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Purpose 


Our objective is to provide a postcalculus introduction to the discipline of 
Statistics that 


Has mathematical integrity and contains some underlying theory. 
Shows students a broad range of applications involving real data. 

Is up to date in its selection of topics. 

Illustrates the importance of statistical software. 

Is accessible to a wide audience, including mathematics and statistics 
majors (yes, there are quite a few of the latter these days, thanks to the 
proliferation of “big data”), prospective engineers and scientists, and 
those business and social science majors interested in the quantitative 
aspects of their disciplines. 


A number of currently available mathematical statistics texts are heavily 
oriented toward a rigorous mathematical development of probability and 
statistics, with much emphasis on theorems, proofs, and derivations. The 
focus is more on mathematics than on statistical practice. Even when applied 
material is included, the scenarios are often contrived (many examples and 
exercises involving dice, coins, cards, widgets, or a comparison of treatment 
A to treatment B). 

Our exposition is an attempt to provide a reasonable balance between 
mathematical rigor and statistical practice. We believe that showing students 
the applicability of statistics to real-world problems is extremely effective in 
inspiring them to pursue further coursework and even career opportunities in 
statistics. Opportunities for exposure to mathematical foundations will follow 
in due course. In our view, it is more important for students coming out of 
this course to be able to carry out and interpret the results of a two-sample 
t test or simple regression analysis, and appreciate how these are based on 
underlying theory, than to manipulate joint moment generating functions or 
discourse on various modes of convergence. 


vi 
Content and Mathematical Level 


The book certainly does include core material in probability (Chap. 2), 
random variables and their distributions (Chaps. 3-5), and sampling theory 
(Chap. 6). But our desire to balance theory with application/data analysis is 
reflected in the way the book starts out, with a chapter on descriptive and 
exploratory statistical techniques rather than an immediate foray into the 
axioms of probability and their consequences. After the distributional 
infrastructure is in place, the remaining statistical chapters cover the basics of 
inference. In addition to introducing core ideas from estimation and 
hypothesis testing (Chaps. 7-10), there is emphasis on checking assumptions 
and examining the data prior to formal analysis. Modern topics such as 
bootstrapping, permutation tests, residual analysis, and logistic regression are 
included. Our treatment of regression, analysis of variance, and categorical 
data analysis (Chaps. 11-13) is definitely more oriented to dealing with real 
data than with theoretical properties of models. We also show many exam- 
ples of output from commonly used statistical software packages, something 
noticeably absent in most other books pitched at this audience and level. 

The challenge for students at this level should lie with mastery of statis- 
tical concepts as well as with mathematical wizardry. Consequently, the 
mathematical prerequisites and demands are reasonably modest. Mathemat- 
ical sophistication and quantitative reasoning ability are certainly important, 
especially as they facilitate dealing with symbolic notation and manipulation. 
Students with a solid grounding in univariate calculus and some exposure to 
multivariate calculus should feel comfortable with what we are asking 
of them. The few sections where matrix algebra appears (transformations in 
Chap. 5 and the matrix approach to regression in the last section of Chap. 12) 
can easily be deemphasized or skipped entirely. Proofs and derivations are 
included where appropriate, but we think it likely that obtaining a conceptual 
understanding of the statistical enterprise will be the major challenge for 
readers. 


Recommended Coverage 


There should be more than enough material in our book for a year-long 
course. Those wanting to emphasize some of the more theoretical aspects 
of the subject (e.g., moment generating functions, conditional expectation, 
transformations, order statistics, sufficiency) should plan to spend corre- 
spondingly less time on inferential methodology in the latter part of the book. 
We have opted not to mark certain sections as optional, preferring instead to 
rely on the experience and tastes of individual instructors in deciding what 
should be presented. We would also like to think that students could be asked 
to read an occasional subsection or even section on their own and then work 
exercises to demonstrate understanding, so that not everything would need to 
be presented in class. Remember that there is never enough time in a course 
of any duration to teach students all that we’d like them to know! 
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Revisions for This Edition 


e Many of the examples have been updated and/or replaced, especially 
those containing real data or references to applications published in 
various journals. The same is true of the roughly 1300 exercises in the 
book. 

e The exposition has been refined and polished throughout to improve 
accessibility and eliminate unnecessary material and verbiage. For 
example, the categorical data chapter (Chap. 13) has been streamlined by 
discarding some of the methodology involving tests when parameters 
must be estimated. 

e A section on simulation has been added to each of the chapters on 
probability, discrete distributions, and continuous distributions. 

e The material in the chapter on joint distributions (Chap. 5) has been 
reorganized. There is now a separate section on linear combinations and 
their properties, and also one on the bivariate normal distribution. 

e The material in the chapter on statistics and their sampling distributions 
(Chap. 6) has also been reorganized. In particular, there is now a separate 
section on the chi-squared, ft, and F distributions prior to the one con- 
taining derivations of sampling distributions of statistics based on a 
normal random sample. 

e The chapters on one-sample confidence intervals (Chap. 8) and hypoth- 
esis tests (Chap. 9) place more emphasis on f¢ procedures and less on 
large-sample z procedures. This is also true of inferences based on two 
samples in Chap. 10. 

e Chap. 9 now contains a subsection on using the bootstrap to test 
hypotheses. 

e The material on multiple regression models containing quadratic, inter- 
action, and indicator variables has been separated into its own section. 
And there is now a separate expanded section on logistic regression. 

e The nonparametric and Bayesian material that previously comprised a 
single chapter has been separated into two chapters, and material has been 
added to each. For example, there is now a section on nonparametric 
inferences about population quantiles. 
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viii 
A Final Thought 


It is our hope that students completing a course taught from this book will 
feel as passionate about the subject of statistics as we still do after so many 
years in the profession. Only teachers can really appreciate how gratifying it 
is to hear from a student after he or she has completed a course that the 
experience had a positive impact and maybe even affected a career choice. 


Los Osos, CA, USA Jay L. Devore 
Normal, IL, USA Kenneth N. Berk 
San Luis Obispo, CA, USA Matthew A. Carlton 
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Introduction 

Statistical concepts and methods are not only useful but indeed often indispensable in understanding 
the world around us. They provide ways of gaining new insights into the behavior of many phe- 
nomena that you will encounter in your chosen field of specialization. 

The discipline of statistics teaches us how to make intelligent judgments and informed decisions in 
the presence of uncertainty and variation. Without uncertainty or variation, there would be little need 
for statistical methods or statisticians. If the yield of a crop was the same in every field, if all 
individuals reacted the same way to a drug, if everyone gave the same response to an opinion survey, 
and so on, then a single observation would reveal all desired information. 

Section 1.1 establishes some key statistics vocabulary and gives a broad overview of how sta- 
tistical studies are conducted. The rest of this chapter is dedicated to graphical and numerical methods 
for summarizing data. 


1.1 The Language of Statistics 


We are constantly exposed to collections of facts, or data, both in our professional capacities and in 
everyday activities. The discipline of statistics provides methods for organizing and summarizing data 
and for drawing conclusions based on information contained in the data. 

An investigation will typically focus on a well-defined collection of objects constituting a pop- 
ulation of interest. In one study, the population might consist of all multivitamin capsules produced 
by a certain manufacturer in a particular week. Another investigation might involve the population of 
all individuals who received a B.S. in statistics or mathematics during the most recent academic year. 
When desired information is available for all objects in the population, we have what is called a 
census. Constraints on time, money, and other scarce resources usually make a census impractical or 
infeasible. Instead, a subset of the population—a sample—is selected in some prescribed manner. 
Thus we might obtain a sample of pills from a particular production run as a basis for investigating 
whether pills are conforming to manufacturing specifications, or we might select a sample of last 
year’s graduates to obtain feedback about the quality of the curriculum. 

We are usually interested only in certain characteristics of the objects in a population: the amount 
of vitamin C in the pill, the sex of a student, the age of a vehicle, and so on. A characteristic may be 
categorical, such as sex or college major, or it may be quantitative in nature. In the former case, the 
value of the characteristic is a category (e.g., female or economics), whereas in the latter case, the 
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value is a number (e.g., age = 5.1 years or vitamin C content = 65 mg). A variable is any charac- 
teristic whose value may change from one object to another in the population. We shall initially 
denote variables by lowercase letters from the end of our alphabet. Examples include 


x = brand of computer owned by a student 
y = number of items purchased by a customer at a grocery store 
z = braking distance of an automobile under specified conditions 


Data comes from making observations either on a single variable or simultaneously on two or 
more variables. A univariate data set consists of observations on a single variable. For example, we 
might consider the type of computer, laptop (L) or desktop (D), for ten recent purchases, resulting in 
the categorical data set 


DLLLEDLLEODLL 


The following sample of lifetimes (hours) of cell phone batteries under continuous use is a quanti- 
tative univariate data set: 


10.6 10.1 11.2 9.0 10.8 95 88 11.5 


We have bivariate data when observations are made on each of two variables. Our data set might 
consist of a (height, weight) pair for each basketball player on a team, with the first observation as 
(72, 168), the second as (75, 212), and so on. If a kinesiologist determines the values of x = recu- 
peration time from an injury and y = type of injury, the resulting data set is bivariate with one variable 
quantitative and the other categorical. Multivariate data arises when observations are made on more 
than two variables. For example, a research physician might determine the systolic blood pressure, 
diastolic blood pressure, and serum cholesterol level for each patient participating in a study. Each 
observation would be a triple of numbers, such as (120, 80, 146). In many multivariate data sets, some 
variables are quantitative and others are categorical. Thus the annual automobile issue of Consumer 
Reports gives values of such variables as type of vehicle (small, sporty, compact, midsize, large), city 
fuel efficiency (mpg), highway fuel efficiency (mpg), drive train type (rear wheel, front wheel, four 
wheel), and so on. 


Branches of Statistics 

An investigator who has collected data may wish simply to summarize and describe important 
features of the data. This entails using methods from descriptive statistics. Some of these methods 
are graphical in nature; the constructions of histograms, boxplots, and scatterplots are primary 
examples. Other descriptive methods involve calculation of numerical summary measures, such as 
means, standard deviations, and correlation coefficients. The wide availability of statistical computer 
software packages has made these tasks much easier to carry out than they used to be. Computers are 
much more efficient than human beings at calculation and the creation of pictures (once they have 
received appropriate instructions from the user!). This means that the investigator doesn’t have to 
expend much effort on “grunt work” and will have more time to study the data and extract important 
messages. Throughout this book, we will present output from various packages such as R, SAS, and 
Minitab. 
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Example 1.1 Charity is a big business in the USA. The website charitynavigator.com gives infor- 
mation on roughly 5500 charitable organizations, and there are many smaller charities that fly below 
the navigator’s radar. Some charities operate very efficiently, with fundraising and administrative 
expenses that are only a small percentage of total expenses, whereas others spend a high percentage of 
what they take in on such activities. Here is data on fundraising expenses, as a percentage of total 
expenditures, for a random sample of 60 charities: 


6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8 
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4 
75 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2 
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8 
8.8 5.1 3:7 26.3 6.0 48.0 8.2 11.7 42 3.9 
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 25 16.2 


Without any organization, it is difficult to get a sense of the data’s most prominent features: what a 
typical (i.e., representative) value might be, whether values are highly concentrated about a typical 
value or quite dispersed, whether there are any gaps in the data, what fraction of the values are less 
than 20%, and so on. Figure 1.1 shows a histogram. In Section 1.2 we will discuss construction and 
interpretation of this graph. For the moment, we hope you see how it describes the way the per- 
centages are distributed over the range of possible values from 0 to 100. Of the 60 charities, 36 use 
less than 10% on fundraising, and 18 use between 10% and 20%. Thus 54 out of the 60 charities in 
the sample, or 90%, spend less than 20% of money collected on fundraising. How much is too much? 
There is a delicate balance: most charities must spend money to raise money, but then money spent on 
fundraising is not available to help beneficiaries of the charity. Perhaps each individual giver should 
draw his or her own line in the sand. 


40 
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Figure 1.1 A histogram for the charity fundraising data of Example 1.1 a 
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Having obtained a sample from a population, an investigator would frequently like to use sample 
information to draw some type of conclusion (make an inference of some sort) about the population. 
That is, the sample is typically a means to an end rather than an end in itself. Techniques for 
generalizing from a sample to a population in a precise and objective way are gathered within the 
branch of our discipline called inferential statistics. 


Example 1.2 The authors of the article “Fire Safety of Glued-Laminated Timber Beams in Bending” 
(J. of Structural Engr. 2017) conducted an experiment to test the fire resistance properties of wood 
pieces connected at corners by sawtooth-shaped “fingers” along with various types of commercial 
adhesive. The beams were all exposed to the same fire and load conditions. The accompanying data 
on fire resistance time (min) for a sample of timber beams bonded with polyurethane adhesive 
appeared in the article: 


47.0 53.0 52.5 52.0 47.5 56.5 45.0 43.5 48.0 48.0 
41.0 34.0 365 49.0 47.5 34.0 34.0 36.0 42.0 


Suppose we want an estimate of the true average fire resistance time under these conditions. (Con- 
ceptualizing a population of all such beams with polyurethane bonding under these experimental 
conditions, we are trying to estimate the population mean.) It can be shown that, with a high degree of 
confidence, the population mean fire resistance time is between 41.2 and 48.0 min; this is called a 
confidence interval or an interval estimate. On the other hand, this data can also be used to predict the 
fire resistance time of a single timber beam under these conditions. With a high degree of certainty, 
the fire resistance time of a single such beam will exceed 29.4 min; the number 29.4 is called a lower 
prediction bound. a 


Probability Versus Statistics 

The main focus of this book is on presenting and illustrating methods of inferential statistics that are 
useful in research. The most important types of inferential procedures—point estimation, hypothesis 
testing, and estimation by confidence intervals—are introduced in Chapters 7—9 and then used in 
more complicated settings in Chapters 10-15. The remainder of this chapter presents methods from 
descriptive statistics that are most used in the development of inference. 

Chapters 2—6 present material from the discipline of probability. This material ultimately forms a 
bridge between the descriptive and inferential techniques. Mastery of probability leads to a better 
understanding of how inferential procedures are developed and used, how statistical conclusions can 
be translated into everyday language and interpreted, and when and where pitfalls can occur in 
applying the methods. Probability and statistics both deal with questions involving populations and 
samples, but do so in an “inverse manner” to each other. 

In probability, properties of the population under study are assumed known (e.g., in a numerical 
population, some specified distribution of the population values may be assumed), and questions 
regarding a sample taken from the population are posed and answered. In statistics, characteristics of 
a sample are available to the experimenter, and this information enables the experimenter to draw 
conclusions about the population. The relationship between the two disciplines can be summarized by 
saying that probability reasons from the population to the sample (deductive reasoning), whereas 
inferential statistics reasons from the sample to the population (inductive reasoning). This is illus- 
trated in Figure 1.2. 
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Probability 


<> 


statistics 


Figure 1.2 The relationship between probability and inferential statistics 


Before we can understand what a particular sample can tell us about the population, we should first 
understand the uncertainty associated with taking a sample from a given population. This is why we 
study probability before statistics. 

As an example of the contrasting focus of probability and inferential statistics, consider drivers’ 
use of seatbelts in automobiles. According to the article “Somehow, Way Too Many Americans Still 
Aren’t Wearing Seatbelts” (www.wired.com, Sept. 2016), data collected by observers from the 
National Highway Traffic Safety Administration indicates that 88.5% of drivers and front seat pas- 
sengers buckle up. But this percentage varies considerably by location. In the 34 states in which a 
driver can be pulled over and cited for nonusage, 91.2% wore their seatbelts in 2015. By contrast, in 
the 15 states where a citation can be given only if a driver is pulled over for another infraction and the 
one state where usage is not mandatory (New Hampshire), usage drops to 78.6%. 

In a probability context, we might assume that 85% of all drivers in a particular metropolitan area 
regularly use seatbelts (an assumption about the population) and then ask, “How likely is it that a 
sample of 100 drivers will include at most 70 who regularly use their seatbelt?” or “How many 
drivers in a sample of size 100 can we expect to regularly use their seatbelt?” On the other hand, in 
inferential statistics, sample information is available, e.g., a sample of 100 drivers from this area 
reveals that 80 regularly use their seatbelts. We might then ask, “Does this provide strong evidence 
for concluding that less than 90% of all drivers in this area are regular seatbelt users?” In this latter 
scenario, sample information will be employed to answer a question about the structure of the entire 
population from which the sample was selected. 

Next, consider a study involving a sample of 25 patients to investigate the efficacy of a new 
minimally invasive method for rotator cuff surgery. The amount of time that each individual sub- 
sequently spends in physical therapy is then determined. The resulting sample of 25 PT times is from 
a population that does not actually exist. Instead it is convenient to think of the population as 
consisting of all possible times that might be observed under similar experimental conditions. Such a 
population is referred to as a conceptual or hypothetical population. There are a number of situations 
in which we fit questions into the framework of inferential statistics by conceptualizing a population. 


Collecting Data 
Statistics deals not only with the organization and analysis of data once it has been collected but also 
with the development of techniques for collecting that data. If data is not properly collected, an 
investigator might not be able to answer the questions under consideration with a reasonable degree 
of confidence. One common problem is that the target population—the one about which conclusions 
are to be drawn—may be different from the population actually sampled. In that case, an investigator 
must be very cautious about generalizing from the circumstances under which data has been gathered. 
For example, advertisers would like various kinds of information about the television-viewing 
habits of potential customers. The most systematic information of this sort comes from placing 
monitoring devices in a small number of homes across the USA. It has been conjectured that 
placement of such devices in and of itself alters viewing behavior, so that characteristics of the sample 
may be different from those of the target population. As another example, a sample of five engines 
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with a new design may be experimentally manufactured and tested to investigate efficiency. These 
five could be viewed as a sample from the conceptual population of all prototypes that could be 
manufactured under similar conditions, but not necessarily as representative of all units manufactured 
once regular production gets under way. Methods for using sample information to draw conclusions 
about future production units may be problematic. Similarly, a new drug may be tried on patients who 
arrive at a clinic (i.e., a voluntary sample), but there may be some question about how typical these 
patients are. They may not be representative of patients elsewhere or patients at the same clinic next 
year. 

When data collection entails selecting individuals or objects from a list, the simplest method for 
ensuring a representative selection is to take a simple random sample. This is one for which any 
particular subset of the specified size (e.g., a sample of size 100) has the same chance of being 
selected. For example, if the list consists of 1,000,000 serial numbers, the numbers 1, 2, ..., up to 
1,000,000 could be placed on identical slips of paper. After placing these slips in a box and thor- 
oughly mixing, slips could be drawn one by one until the requisite sample size has been obtained. 
Alternatively (and much to be preferred), a computer’s random number generator could be employed 
to generate 100 distinct numbers between | and 1,000,000. 

Sometimes alternative sampling methods can be used to make the selection process easier, to 
obtain extra information, or to increase the degree of precision in conclusions. One such method, 
stratified sampling, entails separating the population units into nonoverlapping groups and taking a 
sample from each one. For example, a cell phone manufacturer might want information about 
customer satisfaction for units produced during the previous year. If three different models were 
manufactured and sold, a separate sample could be selected from each of the three corresponding 
strata. This would result in information on all three models and ensure that no one model was over-or 
underrepresented in the entire sample. 

Frequently a “convenience” sample is obtained by selecting individuals or objects without sys- 
tematic randomization. As an example, a collection of bricks may be stacked in such a way that it is 
extremely difficult for those in the center to be selected. If the bricks on the top and sides of the stack 
were somehow different from the others, the resulting sample data would not be representative of the 
population. Often an investigator will assume that such a convenience sample approximates a random 
sample, in which case a statistician’s repertoire of inferential methods can be used; however, this is a 
judgment call. 

Researchers may also collect data by carrying out some sort of designed experiment. This may 
involve deciding how to allocate several different treatments (such as fertilizers or drugs) to various 
experimental units (plots of land or patients). Alternatively, an investigator may systematically vary 
the levels or categories of certain factors (e.g., amount of fertilizer or dose of a drug) and observe the 
effect on some response (such as corn yield or blood pressure). 


Example 1.3 Neonicotinoid insecticides (NNIs) are popular in agricultural use, especially for 
growing corn, but scientists are increasingly concerned about their effects on bee populations. An 
article in Science (June 30, 2017) described the results of a two-year study in which scientists 
randomly assigned some bee colonies to be exposed to “field-realistic” levels and durations of NNIs, 
while other colonies did not have NNI exposure. The researchers found that bees in the colonies 
exposed to NNIs had a 23% reduced life span, on average, compared to those in nonexposed colonies. 
One possible explanation for this result is chance variation—i.e., that NNIs really don’t affect bee 
colony health and the observed difference is just “random noise,” in the same way that tossing two 
identical coins 10 times each will usually produce different numbers of heads. However, in this case, 
inferential methods discussed in this textbook (and in the original article) suggest that chance 
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variation by itself cannot adequately explain the magnitude of the observed difference, indicating that 
NNIs may very well be responsible for the reduced average life span. Hi 


Exercises: Section 1.1 (1-13) 


1. Give one possible sample of size 4 from each 5. The authors of the article “From Dark to 


of the following populations: 


a. All daily newspapers published in the 
USA 
b. All companies listed on the New York 
Stock Exchange 
. All students at your college or university 
d. All grade point averages of students at 
your college or university 


Q 


2. For each of the following hypothetical pop- 


ulations, give a plausible sample of size 4: 


a. All distances that might result when you 
throw a football 

b. Page lengths of books published 5 years 
from now 

c. All possible earthquake strength mea- 
surements (Richter scale) that might be 
recorded in California during the next 
year 

d. All possible yields Gin grams) from a 
certain chemical reaction carried out in a 
laboratory 


. Consider the population consisting of all cell 
phones of a certain brand and model, and 
focus on whether a cell phone needs service 
while under warranty. 


a. Pose several probability questions based 
on selecting a sample of 100 such cell 
phones. 

b. What inferential statistics question might 
be answered by determining the number 
of such cell phones in a sample of size 
100 that need warranty service? 


. Give three different examples of concrete 
populations and three different examples of 
hypothetical populations. For one each of 
your concrete and hypothetical populations, 
give an example of a probability question and 
an example of an_ inferential statistics 
question. 


7. 


Light: Skin Color and Wages among African 
Americans” (J. of Human Resources 2007: 
701-738) investigated the association 
between darkness of skin and hourly wages. 
For a sample of 948 African Americans, skin 
color was classified as dark black, medium 
black, light black, or white. 


a. What variables were recorded for each 
member of the sample? 

b. Classify each of these variables as quan- 
titative or categorical. 


. Consumer Reports compared the actual 


polyunsaturated fat percentages for different 
brands of “low-fat” margarine. Twenty-six 
containers of margarine were purchased; for 
each one, the brand was noted and the per- 
cent of polyunsaturated fat was determined. 


a. What variables were recorded for each 
margarine container in the sample? 

b. Classify each of these variables as quan- 
titative or categorical. 

c. Give some examples of inferential statis- 
tics questions that Consumer Reports 
might try to answer with the data from 
these 26 margarine containers. 

d. “The average polyunsaturated fat content 
for the five Parkay margarine containers 
in the sample was 12.8%.” Is the pre- 
ceding sentence an example of descriptive 
Statistics or inferential statistics? 


The article “Is There a Market for Functional 
Wines? Consumer Preferences and Willing- 
ness to Pay for Resveratrol-Enriched Red 
Wine” (Food Quality and Preference 2008: 
360-371) included the following information 
for a variety of Spanish wines: 

a. Region of origin 

b. Price of the wine, in euros 

c. Style of wine (young or crianza) 


d. Production method (conventional or 
organic) 

e. Type of grapes used (regular or 
resveratrol-enhanced) 

Classify each of these variables as quanti- 

tative or categorical. 


8. The authors of the article cited in the pre- 


vious exercise surveyed 300 wine con- 
sumers, each of whom tasted two different 
wines. For each individual in the study, the 
following information was recorded: 


Gender 

Age, in years 

Monthly income, in euros 

Educational level (primary, secondary, 

or university) 

e. Willingness to pay (WTP) for the first 
wine tasted, in euros 

f. WTP for the second wine tasted, in 
euros. 

(WTP is a very common measure for con- 

sumer products. Researchers ask, “How 

much would you be willing to pay for this 

item?”) Classify each of the variables (a)— 

(f) as quantitative or categorical. 


aos, 


. Many universities and colleges have insti- 
tuted supplemental instruction (SD programs, 
in which a student facilitator meets regularly 
with a small group of students enrolled in the 
course to promote discussion of course 
material and enhance subject mastery. Sup- 
pose that students in a large statistics course 
(what else?) are randomly divided into a 
control group that will not participate in SI 
and a treatment group that will participate. At 
the end of the term, each student’s total score 
in the course is determined. 


a. Are the scores from the SI group a 
sample from an existing population? If 
so, what is it? If not, what is the relevant 
conceptual population? 

b. What do you think is the advantage of 
randomly dividing the students into the 
two groups rather than letting each student 
choose which group to join? 

c. Why didn’t the investigators put all 
students in the treatment group? 


10. 


11. 


12. 


13. 
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The California State University (CSU) sys- 
tem consists of 23 campuses, from San 
Diego State in the south to Humboldt State 
near the Oregon border. A CSU adminis- 
trator wishes to make an inference about the 
average distance between the hometowns 
of students and their campuses. Describe 
and discuss several different sampling 
methods that might be employed. 

A certain city divides naturally into ten 
district neighborhoods. A_ real _ estate 
appraiser would like to develop an equation 
to predict appraised value from character- 
istics such as age, size, number of bath- 
rooms, distance to the nearest school, and 
so on. How might she select a sample of 
single-family homes that could be used as a 
basis for this analysis? 

The amount of flow through a solenoid 
valve in an automobile’s pollution control 
system is an important characteristic. An 
experiment was carried out to study how 
flow rate depended on three factors: arma- 
ture length, spring load, and bobbin depth. 
Two different levels (low and high) of each 
factor were chosen, and a single observa- 
tion on flow was made for each combina- 
tion of levels. 


a. The resulting data set consisted of how 
many observations? 

b. Does this study involve sampling an 
existing population or a conceptual 
population? 

In a famous experiment carried out in 1882, 

Michelson and Newcomb obtained 66 

observations on the time it took for light to 

travel between two locations in Washing- 
ton, D.C. A few of the measurements 

(coded in a certain manner) were 31, 23, 

32, 36, 22, 26, 27, and 31. 


a. Why are 
identical? 

b. Does this study involve sampling an 
existing population or a conceptual 
population? 


these measurements not 
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There are two general types of methods within descriptive statistics: graphical and numerical sum- 
maries. In this section we will discuss the first of these types—representing a data set using visual 
techniques. In Sections 1.3 and 1.4, we will develop some numerical summary measures for data sets. 
Many visual techniques may already be familiar to you: frequency tables, histograms, pie charts, bar 
graphs, scatterplots, and the like. Here we focus on a selected few of these techniques that are most 
useful and relevant to probability and inferential statistics. 


Notation 

Some general notation will make subsequent discussions easier. The number of observations in a 
single sample, that is, the sample size, will often be denoted by n. So n =4 for the sample of 
universities {Stanford, Iowa State, Wyoming, Rochester} and also for the sample of pH measure- 
ments {6.3, 6.2, 5.9, 6.5}. If two samples are simultaneously under consideration, either m and n or n; 
and n2 can be used to denote the numbers of observations. Thus if {3.75, 2.60, 3.20, 3.79} and {2.75, 
1.20, 2.45} are GPAs for two sets of friends, respectively, then m = 4 and n = 3. 

Given a data set consisting of n observations on some variable x, the individual observations will 
be denoted by x1, x2, x3, ..., X,. The subscript bears no relation to the magnitude of a particular 
observation. Thus x, will not in general be the smallest observation in the set, nor will x,, typically be 
the largest. In many applications, x, will be the first observation gathered by the experimenter, x2 the 
second, and so on. The ith observation in the data set will be denoted by ¥;. 


Stem-and-Leaf Displays 

Consider a numerical data set x, x2, ..., x, for which each x; consists of at least two digits. A quick 
way to obtain an informative visual representation of the data set is to construct a stem-and-leaf 
display, or stem plot. 


Steps for constructing a stem-and-leaf display 

1. Select one or more leading digits for the stem values. The trailing digits become the leaves. 
2. List possible stem values in a vertical column. 

3. Record the leaf for every observation beside the corresponding stem value. 

4. Order the leaves from smallest to largest on each line. 

5. Indicate the units for stems and leaves someplace in the display. 


If the data set consists of exam scores, each between 0 and 100, the score of 83 would have a stem 
of 8 and a leaf of 3. For a data set of automobile fuel efficiencies (mpg), all between 8.1 and 47.8, we 
could use the tens digit as the stem, so 32.6 would then have a leaf of 2.6. Usually, a display based on 
between 5 and 20 stems is appropriate. 

For a simple example, assume a sample of seven test scores: 93, 84, 86, 78, 95, 81, 72. Then the 
first-pass stem plot would be 


7/82 
8/461 
9|35 
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With the leaves ordered this becomes 


7|28 Stem: tens digit 
8|146 Leaf: ones digit 
9/35 


Occasionally stems will be repeated to spread out the stem-and-leaf display. For instance, if the 
preceding test scores included dozens of values in the 70s, we could repeat the stem 7 twice, using 7L 
for scores in the low 70s (leaves 0, 1, 2, 3, 4) and 7H for scores in the high 70s (leaves 5, 6, 7, 8, 9). 


Example 1.4 Job prospects for students majoring in an engineering discipline continue to be very 
robust. How much can a new engineering graduate expect to earn? Here are the starting salaries for a 
sample of 38 civil engineers from one author’s home institution (Spring 2016), courtesy of the 
university’s Graduate Status Report: 


58,000 62,000 56,160 67,000 66,560 58,240 60,000 61,000 70,000 61,000 
65,000 60,000 61,000 80,000 62,500 75,000 60,000 68,000 57,600 65,000 
55,000 63,000 60,000 70,000 68,640 72,000 83,000 50,128 56,000 63,000 
55,000 52,000 70,000 80,000 60,320 65,000 70,000 65,000 


Figure 1.3 shows a stem-and-leaf display of these 38 starting salaries. Hundreds places and lower 
have been truncated; for instance, the lowest salary in the sample was $50,128, which is represented 
by 5|O in the first row. 


SL | 02 

5H | 5566788 Stem: $10,000 
6L | 000001112233 Leaf: $1000 
6H | 55556788 

7L | 00002 

7H | 5 

8L_ | 003 


Figure 1.3 Stem-and-leaf display for starting salaries of civil engineering graduates 


Typical starting salaries were in the $60,000-$65,000 range, with most graduates starting between 
$55,000 and $70,000. A lucky (and/or exceptionally talented!) handful of students earned $80,000 or 
more upon graduation. a 

Most graphical displays of quantitative data, including the stem-and-leaf display, convey infor- 
mation about the following aspects of the data: 


Identification of a typical or representative value 
Extent of spread about the typical value 
Presence of any gaps in the data 

Extent of symmetry in the distribution of values 
Number and location of peaks 


Presence of any outlying values (i.e. unusally small or large) 
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Example 1.5 Figure 1.4 presents stem-and-leaf displays for a random sample of lengths of golf 
courses (yards) that have been designated by Golf Magazine as among the most challenging in the USA. 
Among the sample of 40 courses, the shortest is 6433 yards long, and the longest is 7280 yards. The 
lengths appear to be distributed in a roughly uniform fashion over the range of values in the sample. 
Notice that a stem choice here of either a single digit (6 or 7) or three digits (643, ..., 728) would yield 
an uninformative display, the first because of too few stems and the latter because of too many. 


a b 
641 33 35 64 70 Stem: Thousands and hundreds digits Stem-and-leaf of yardage N = 40 
651 06 26 27 83 Leaf: Tens and ones digits ae = 10 
661 05 14 94 65 0228 
671 00 13 45 70 70 90 98 66 019 
67 0147799 
681 50 70 73 90 és 5479 
691 00 04 27 36 69 0023 
701 05 11 22 40 5051 70 012455 
711 05 13 31 65 68 69 Ca oe 
721 09 80 


Figure 1.4 Stem-and-leaf displays of golf course yardages: (a) two-digit leaves; 
(b) display from Minitab with truncated one-digit leaves a 


Dotplots 

A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are 
relatively few distinct data values. Each observation is represented by a dot above the corresponding 
location on a horizontal measurement scale. When a value occurs more than once, there is a dot for 
each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot 
gives information about location, spread, extremes, and gaps. 


Example 1.6 For decades, The Economist has used its “Big Mac index,” defined for any country as 
the average cost of a McDonald’s Big Mac, as a humorous way to compare product costs across 
nations and also examine the valuation of the US dollar worldwide. Here are values of the Big Mac 
index, converted to US dollars, for 56 countries reported by The Economist on January 1, 2019 (listed 
in alphabetical order by country name): 


2.00 4.35 2.33 3.18 4.55 4.07 5.08 3.89 3.05 3.73 
3.77 3.24 3.81 4.60 2.23 4.64 3.23 3.49 2:55 3.03 
2.55 2.34 4.58 3.60 2.75 3.46 4.31 2.20 2.54 2.32 
4.19 3.18 5.86 213 3.31 3.14 2.67 2.80 3.30 2.29 
1.65 3.20 4.28 2.24 4.02 3.18 5.84 6.62 2.24 3.72 
2.00 1.94 3.81 5.58 4.31 2.80 


Figure 1.5 shows a dotplot of these values. We can see that the average cost of a Big Mac in the 
USA, $5.58, is higher than in all but three countries (Sweden, Norway, and Switzerland). A typical 
Big Mac index value is around $3.20, but those values vary substantially across the globe. The 
distribution extends farther to the right of that typical value than to the left, due to a handful of 
comparatively large Big Mac prices in the USA and a few other countries. 
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; US. ($5.58) 
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 


Big Mac index (US$) 


Figure 1.5 A dotplot of the data from Example 1.6 


How can the Big Mac index be used to assess the strength of the US dollar? As an example, a Big 
Mac cost £3.19 in Britain, which converts to $4.07 using the exchange rate at the time the data was 
collected. Since that same amount of British currency ought to buy $5.58 worth of American goods 
according to the Big Mac index, it appears that the British pound was substantially undervalued at the 
time. a 


Histograms 
While stem-and-leaf displays and dotplots are useful for smaller data sets, histograms are well-suited 
to larger samples or the results of a census. 

Consider first data resulting from observations on a “counting variable” x, such as the number of 
traffic citations a person received during the last year, or the number of people arriving for service 
during a particular period. The frequency of any particular x value is simply the number of times that 
value occurs in the data set. The relative frequency of a value is the fraction or proportion of times 
the value occurs: 


number of times the value occurs 


relative frequency of a value = - - 
4 y number of observations in the data set 


Suppose, for example, that our data set consists of 200 observations on x = the number of major 
defects in a new car of a certain type. If 70 of these x values are 1, then the frequency of the value 1 is 
(obviously) 70, while the relative frequency of the value 1 is 70/200 = .35. Multiplying a relative 
frequency by 100 gives a percentage; in the defect example, 35% of the cars in the sample had just 
one major defect. The relative frequencies, or percentages, are usually of more interest than the 
frequencies themselves. In theory, the relative frequencies should sum to 1, but in practice the sum 
may differ slightly from 1 because of rounding. A frequency distribution is a tabulation of the 
frequencies and/or relative frequencies. 


Example 1.7 How unusual is a no-hitter or a one-hitter in a major league baseball game, and how 
frequently does a team get more than 10, 15, or 20 hits? Table 1.1 is a frequency distribution for the 
number of hits per team per game for all games in the 2016 regular season, courtesy of the website 
www.retrosheet.org. 
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Table 1.1 Frequency distribution for hits per team in 2016 MLB games 


Hits/Team/ Number of Relative Hits/Team/ Number of Relative 
Game games frequency Game games frequency 
0 1 .0002 12 297 .0612 

1 19 .0039 13 202 .0416 

2; 56 0115 14 176 .0362 

3 151 0311 15 102 .0210 

4 279 .0575 16 71 .0146 

5 380 .0783 17 56 .O115 

6 471 .0970 18 32 .0066 

7 554 1141 19 22 .0045 

8 564 1161 20 3 -0006 

9 556 1145 21 4 .0008 
10 469 .0966 22 5 -0010 
wu 386 .0795 4,856 .9999 


The corresponding histogram in Figure 1.6 rises rather smoothly to a single peak and then 
declines. The histogram extends a bit more to the right (toward large values) than it does on the left— 
a slight “positive skew.” 


Relative frequency 
> 


10 
Number of hits 


Figure 1.6 Relative frequency histogram of x = number of hits per team per game for the 2016 MLB season 


Either from the tabulated information or from the histogram, we can determine the following: 


proportion of relative relative relative 
instancesofat = frequency + frequency + _ frequency 
most two hits forx = 0 forx = 1 forx = 2 


= .0002 + .0039 + .0115 = .0156 
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Similarly, 


proportion of instances of between 5 and 10 hits (inclusive) = .0783 + .0970 + --- + .0966 = .6166 


That is, roughly 62% of the time that season, a team had between 5 and 10 hits (inclusive) in a game. 
Incidentally, the only no-hitter of the season (notice the frequency of 1 for x = 0) came on April 21, 
2016, with Jake Arrieta pitching the complete game for the Chicago Cubs against the Cincinnati Reds. 


Constructing a histogram for measurement data (e.g., weights of individuals, reaction times to a 
particular stimulus) requires subdividing the measurement axis into a suitable number of class 
intervals or classes, such that each observation is contained in exactly one class. Suppose, for 
example, that we have 50 observations on x = fuel efficiency of an automobile (mpg), the smallest of 
which is 27.8 and the largest of which is 31.4. Then we could use the class boundaries 27.5, 28.0, 
28.5, ..., and 31.5 as shown here: 


27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 


When all class widths are equal, a histogram is constructed as follows: first, mark the class 
boundaries on a horizontal axis like the one above. Then, above each interval, draw a rectangle whose 
height is the corresponding relative frequency (or frequency). 

One potential difficulty is that occasionally an observation falls on a class boundary and therefore 
does not lie in exactly one interval, for example, 29.0. We will use the convention that any obser- 
vation falling on a class boundary will be included in the class to the right of the observation. Thus 
29.0 would go in the 29.0—29.5 class rather than the 28.5—29.0 class. This is how Minitab constructs a 
histogram; in contrast, the default histogram in R does it the other way, with 29.0 going into the 28.5— 
29.0 class. 


Example 1.8 Power companies need information about customer usage to obtain accurate forecasts 
of demands. Investigators from Wisconsin Power and Light determined energy consumption (BTUs) 
during a particular period for a sample of 90 gas-heated homes. For each home, an adjusted con- 
sumption value was calculated to account for weather and house size. This resulted in the accom- 
panying data (part of the stored data set furnace.mtw available in Minitab), which we have ordered 
from smallest to largest. 


2.97 4.00 5.20 5.56 5.94 5.98 6.35 6.62 6.72 6.78 
6.80 6.85 6.94 7.15 7.16 7.23 7.29 7.62 7.62 7.69 
dlS 7.87 7.93 8.00 8.26 8.29 8.37 8.47 8.54 8.58 
8.61 8.67 8.69 8.81 9.07 9.27 9.37 9.43 9.52 9.58 
9.60 9.76 9.82 9.83 9.83 9.84 9.96 10.04 10.21 10.28 
10.28 10.30 10.35 10.36 10.40 10.49 10.50 10.64 10.95 11.09 
11.12 11.21 11.29 11.43 11.62 11.70 11.70 12.16 12.19 12.28 
12.31 12.62 12.69 12.71 12.91 12.92 13.11 13.38 13.42 13.43 


13.47 13.60 13.96 14.24 14.35 15.12 15.24 16.06 16.90 18.26 
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We let Minitab select the class intervals. The most striking feature of the histogram in Figure 1.7 is 


its resemblance to a bell-shaped (and therefore symmetric) curve, with the point of symmetry roughly 
at 10. 


30 + 


20 5 


Percent 


105 


1 3 5 7 9 T1 13 15 17 19 
BTUN 


Figure 1.7 Minitab histogram of the energy consumption data from Example 1.8 


Class 1-<3 3-<5 5-<7 7-<9 9-<11 11-<13 13-<15 15-<17 17-<19 
Frequency 1 1 11 21 25 17 9 4 1 
Relative frequency O11 O11 122 233 278 189 .100 .044 O11 


From the histogram, 


proportion of 


observations ~ .01 + .01+.12+4 .23 = .37 (exact value = # = 378) 
less than 9 


The relative frequency for the 9-11 class is about .27, so we estimate that roughly half of this, or .135, 
is between 9 and 10. Thus 


proportion of observations 
less than 10 = 37+ .135 = .505 (slightly more than 50%) 


The exact value of this proportion is 47/90 = .522. : 


There are no hard-and-fast rules concerning either the number of classes or the choice of classes 
themselves. Between 5 and 20 classes will be satisfactory for most data sets. Generally, the larger the 
number of observations in a data set, the more classes should be used. A reasonable rule is 


number of classes ~ Vnumber of observations 


Equal-width classes may not be a sensible choice if a data set “stretches out” to one side or the other. 
Figure 1.8 shows a dotplot of such a data set. If a large number of short, equal width classes are used, 
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many classes will have zero frequency. Using a small number of wide equal width classes results in 
almost all observations falling in just one or two of the classes. A sound choice is to use a few wider 
intervals near extreme observations and narrower intervals in the region of high concentration. In such 
situations a density histogram must be used. 


Figure 1.8 Selecting class intervals for “stretched out” dots: (a) many short equal width intervals; 
(b) a few wide equal width intervals; (c) unequal width intervals 


DEFINITION For any class to be used in a histogram, the density of the data in that class 
is defined by 


relative frequency of the class 


density = 
eae class width 


A histogram can then be constructed in which the height of the rectangle over each class is its density. 
The vertical scale on such a histogram is called a density scale. 

When class widths are unequal, not using a density scale will give a picture with distorted areas. 
For equal class widths, the divisor is the same in each density calculation, and the extra arithmetic 
simply results in a rescaling of the vertical axis (i.e., the histogram using relative frequency and the 
one using density will have exactly the same appearance). 

A density histogram does have one interesting property. Multiplying both sides of the formula for 
density by the class width gives 


relative frequency = (class width) (density) = (rectangle width) (rectangle height) 


= rectangle area 


That is, the area of each rectangle is the relative frequency of the corresponding class. Furthermore, 
because the sum of relative frequencies must be 1.0 (except for roundoff), the total area of all 
rectangles in a density histogram is 1. It is always possible to draw a histogram so that the area equals 
the relative frequency (this is true also for a histogram of counting data)—just use the density scale. 
This property will play an important role in creating models for certain distributions in Chapter 4. 


Example 1.9 The Environmental Protection Agency (EPA) publishes information each year on the 
estimate gas mileage as well as expected annual fuel cost for hundreds of new vehicles. For 2018, the 
EPA evaluated 369 different cars and trucks. The fuel cost estimates ranged from $700 (Toyota Camry 
Hybrid LE) to $3800 (Bugatti Chiron). We have divided the expected annual fuel costs of these 
vehicles, in hundreds of dollars, into five intervals: 7-<10, 10-<15, 15-<20, 20-<25, and 25-38. 
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Class 7-<10 10-<15 15-<17.5 17.5-<20 20-<25 25-38 
Frequency 3 86 103 83 74 20 

Relative frequency 0081 2331 2791 2249 2005 0542 
Density 0027 0466 mba, 0900 0401 0042 


The resulting histogram appears in Figure 1.9. The right or upper tail stretches out much farther 
than does the left or lower tail—a substantial departure from symmetry. Thankfully, high- 
efficiency/low-cost vehicles predominate, and gas guzzlers are relatively rare. 
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Figure 1.9 R density histogram for the fuel cost data of Example 1.9 a 


Histogram Shapes 

Histograms come in a variety of shapes. A unimodal histogram is one that rises to a single peak and 
then declines. A bimodal histogram has two different peaks. Bimodality can occur when the data set 
consists of observations on two quite different kinds of individuals or objects. For example, the 
histogram of a data set consisting of driving times between San Luis Obispo and Monterey in 
California would show two peaks, one for those cars that took the inland route (roughly 2.5 h) and 
another for those cars traveling up the coast (3.5—-4 h). A histogram with more than two peaks is said 
to be multimodal. 

A histogram is symmetric if the left half is a mirror image of the right half. A unimodal histogram 
is positively skewed if the right or upper tail is stretched out compared with the left or lower tail and 
negatively skewed if the stretching is to the left. Figure 1.10 shows “smoothed” histograms, obtained 
by superimposing a smooth curve on the rectangles, that illustrate various possibilities. 


a b c d 
Figure 1.10 Smoothed histograms: (a) symmetric unimodal; (b) bimodal; (c) positively skewed; 
and (d) negatively skewed 
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Categorical Data 

Both a frequency distribution and a pie chart or bar graph can be constructed when a data set is 
categorical in nature; generally speaking, statisticians prefer bar graphs over pie charts in most 
circumstances. Sometimes there will be a natural ordering of categories (freshman, sophomore, 
junior, senior, graduate student); for such ordinal data the categories should be presented in their 
natural order. In other cases the order will be arbitrary (e.g., Catholic, Jewish, Protestant, and so on); 
while we have the choice of displaying nominal data in any order, it’s common to sort the categories 
in decreasing order of their (relative) frequencies. Either way, the rectangles for the bar graph should 
have equal width. 


Example 1.10 Each member of a sample of 120 individuals owning motorcycles was asked for the 


name of the manufacturer of his or her bike. The frequency distribution for the resulting data is given 
in Table 1.2 and the bar chart is shown in Figure 1.11. 


Table 1.2 Frequency distribution for motorcycle data 


Manufacturer Frequency Relative frequency 
1. Honda 41 34 
2. Yamaha 27 23 
3. Kawasaki 20 7 
4. Harley-Davidson 18 AS 
5. BMW 3 .03 
6. Other 11 .09 
120 1.01 
35 
30 
25 
sy 
8 
S .20 
& 
o 
aa 
% 1S 
4 
.10 
5 
0 a a 


Honda Yamaha Kawasaki Harley-Davidson BMW Other 


Figure 1.11 Bar chart for motorcycle data a 
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Multivariate Data 

The techniques presented so far have been exclusively for situations in which each observation in a 
data set is either a single number or a single category. Often, however, our data is multivariate in 
nature. That is, if we obtain a sample of individuals or objects and on each one we make two or more 
measurements, then each “observation” would consist of several measurements on one individual or 
object. The sample is bivariate if each observation consists of two measurements or responses, so that 
the data set can be represented as (x1, y1), ..., (%,, Y,). For example, x might refer to engine size and 
y to horsepower, or x might refer to brand of cell phone owned and y to academic major. We consider 
the analysis of multivariate data in several later chapters. 


Exercises: Section 1.2 (14—39) 


14. Consider the fire resistance time data given 55.3 55.3 55.3 55.9 55.9 55.9 55.9 56.1 56.1 56.1 56.1 
: 56.1 56.1 56.8 56.8 57.0 57.0 57.0 57.8 57.8 57.8 57.9 
in Example 1.2. 


57.9 57.9 58.8 58.8 58.8 59.8 59.8 59.8 62.2 62.2 63.8 


a. Construct a stem-and-leaf display of the 3-8 63.8 63.9 63.9 63.9 64.7 64.7 64.7 65.1 65.1 65.1 
65.3 65.3 65.3 65.3 67.4 67.4 67.4 67.4 68.7 68.7 68.7 


data. What appears to be a representa- — ¢g-7 69.9 704 70.4 71.2 71.2 71.2 73.0 73.0 73.1 73.1 
tive value? Do the observations appear 74.6 74.6 74.6 74.6 79.3 79.3 79.3 79.3 83.0 83.0 83.0 


to be highly concentrated about the 


representative value or rather spread Construct a stem-and-leaf display using 

out? repeated stems, and comment on any inter- 
b. Does the display appear to be reason- esting features of the display. 

ably symmetric about a representative 17, The following data on crack depth (um) was 

value, or would you describe its shape read from a graph in the article “Effects of 

in some other way? Electropolishing on Corrosion and Stress 
c. Do there appear to be any outlying fire Corrosion Cracking of Alloy 182 in High 

resistance times? Temperature Water” (Corrosion Sci. 2017: 1- 
d. What proportion of times in this sample 10) 


exceed 45 min? 


: 15 29 31 33 34 36 3.7 38 39 41 
15. Construct a stem-and-leaf display for the 43 45 46 47 48 52 53 55 56 59 


given batch of exam scores, repeating each 61 69 72 75 80 80 81 82 85 92 


. 95 99 10.0 105 10.5 10.7 10.9 10.9 11.2 113 
stem twice (so, 6L through 9H). What = 1)3 118 i209 127 144 157 173 184 199 20.0 


feature of the data is highlighted by this 21.7 21.8 224 264 33.7 33.8 34.0 37.8 42.2 46.0 


display? 48.6 50.2 51.4 52.4 66.5 76.1 81.1 
74 89 80 93 4264 «267 +#« 72 «70 66 85 a. Construct a stem-and-leaf display of the 
go 81 81 71 7 82 8 63 #72 81 Asta. 


b. What is a typical, or representative, crack 


depth? 

16. A sample of 77 individuals working at a c. Does the display appear to be highly 
particular office was selected and the noise concentrated or spread out? 
level (dBA) experienced by each one was d. Does the distribution of values appear to 
determined, yielding the following data be reasonably symmetric? If not, how 
(“Acceptable Noise Levels for Construction would you describe the departure from 
Site Offices,” Build. Serv. Engr. Res. symmetry? 
Technol. 2009: 87-94). e. Would you describe any observations as 


being far from the rest of the data (outliers)? 
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18. 


19. 


French: 


The J5th Annual Demographia Interna- 
tional Housing Affordability Survey: 2019 
reports the “median multiple,” the ratio of 
median home price to median household 
income, for 188 metropolitan areas in the 
United States. (A higher median multiple 
means that it’s harder for residents of that 
area to purchase a home.) The resulting 
data appears below. 


3.0 41 3.1 30 38 39 48 35 28 
41 43 36 34 31 35 5.1 53 67 
49 30 28 26 43 25 45 38 3.4 
28 #29 28 #45 45 32 32 3.1 3.7 
23 27 43 55 28 30 28 43 3.5 
56 28 32 30 27 52 29 44 2.6 
44 31 47 28 32 41 3.1 3.1 2.5 
28 86 3.7 29 30 29 3.1 3.9 2.8 
40 29 32 34 34 38 32 24 34 
30 30 28 92 3.1 28 32 37 34 
40 34 60 5.7 38 34 3.0 5.0 2.8 
5339 34 31 40 55 34 3.7) 2.7 
27 45 71 36 23 34 43 26 44 
52 43 43 38 2.7 58 3.7 5.6 3.2 
22 56 50 75 39 44 39 78 8.8 
81 75 96 %75 46 3.2 25 56 4.1 
26 3.2 43 3.8 30 28 39 23 3.7 
25 32 40 31 22 54 35 46 3.3 
28 50 33 38 43 2.8 2.2 


a. Construct a stem-and-leaf display of the 
data. 

b. What is a typical, or representative, 
median multiple? 

c. Describe the shape of the distribution. 

d. Values above 5.0 earn the city a 
“Severely Unaffordable” rating. What 
proportion of cities in the study have 
severely unaffordable housing? 


Do running times of American movies 
differ somehow from times of French 
movies? The authors investigated this 
question by randomly selecting 25 recent 
movies of each type, resulting in the fol- 
lowing running times: 


American: 94 90 95 93 128 95 125 
91 104 116 162 102 90 =110 
92 113 «116 90 97 = 103 95 
120 109 91 138 
123 116 90 158 122 119 125 
90 96 94 137 102 105 ~~ 106 
95 125 122 103 96 =«Ill 81 
113-128 93 92 


20. 
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Construct a comparative stem-and-leaf 
display by listing stems in the middle of 
your paper and then placing the American 
leaves out to the left and the French leaves 
out to the right. Then comment on inter- 
esting features of the display. 


The report “Congestion Reduction Strate- 
gies” (Texas Transportation Institute, 2005) 
investigated how much additional time (in 
hours, per year per traveler) drivers spend 
in traffic during peak hours for a sample of 
urban areas. Data on “large” (e.g., Denver) 
and “very large” (e.g., Houston) urban 
areas appear below. 


Large: 55 55 53 52 51 50 46 


46 43 40 39 38 35 33 
33 30 30 29 26 23 18 
17 14 13 12 10 


Very Large: 93 72 69 67 63 60 58 


21. 


non 


ofr 


57 51 51 49 49 38 


Construct a comparative stem-and-leaf 
display (see Exercise 19) of this data. 
Compare and contrast the extra time spent 
in traffic for drivers in large urban areas 
and very large urban areas. 


Temperature transducers of a certain type 
are shipped in batches of fifty. A sample of 
60 batches was selected, and the number of 
transducers in each batch not conforming to 
design specifications was determined, 
resulting in the following data: 
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a. Determine frequencies and relative fre- 
quencies for the observed values of 
x = number of nonconforming trans- 
ducers in a batch. 

b. What proportion of batches in the sam- 
ple have at most five nonconforming 
transducers? What proportion have 
fewer than five? What proportion have 
at least five nonconforming units? 
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22. 


23. 


c. Draw a histogram of the data using 
relative frequency on the vertical scale, 
and comment on its features. 


Lotka’s law is used in library science to 
describe the productivity of authors in a given 
field. The article “Lotka’s Law and Produc- 
tivity Patterns of Authors in Biomedical Sci- 
ence in Nigeria on HIV/AIDS: a Bibliometric 
Approach” (The Electronic Library 2016: 
789-807) provides the following frequency 
distribution for the number of articles written 
by various authors on HIV/AIDS over a five- 
year period in Nigeria: 


Number of papers 1 2 3 4 5 6 
Frequency 650 90 73 40 35 30 
Number of papers 7 8 9 10 11 
Frequency 23 17 15 10 5 


a. Construct a histogram corresponding to 
this frequency distribution. What is the 
most interesting feature of the shape of 
the distribution? 

b. What proportion of these authors pub- 
lished at least five papers? More than 
five papers? 

c. Suppose the ten 10s and five 11s had 
been lumped into a single category 
displayed as “10+.” Would you be able 
to draw a histogram? Explain. 

d. Suppose that instead of the values 10 
and 11 being listed separately, they had 
been combined into a 10-11 category 
with frequency 15. Would you be able 
to draw a histogram? Explain. 


The article “Ecological Determinants of 
Herd Size in the Thorncraft’s Giraffe of 
Zambia” (Afric. J. Ecol. 2010: 962-971) 
gave the following data (read from a graph) 
on herd size for a sample of 1570 herds 
over a 34-year period. 


Herd size 1 2 3 4 5 6 7 8 
Frequency 
Herd size 9 10 11 12 13, 14 #15 17 
Frequency 33 31 22 10 4 10 Il 5 
Herd size 18 19 20 «22 23, 24 26 32 
Frequency 2 4 2 2 2 2 1 1 


589 176 
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1280 
1050 
1320 

960 
3150 
2700 

510 


b. Construct 


29: 


21 

a. What proportion of the sampled herds 
had just one giraffe? 

b. What proportion of the sampled herds 
had six or more giraffes (characterized in 
the article as “large herds”)? 

c. What proportion of the sampled herds 
had between five and ten giraffes, 
inclusive? 

d. Draw a histogram using relative fre- 
quency on the vertical axis. How would 
you describe the shape of this his- 
togram? 

The article “Determination of Most Repre- 
sentative Subdivision” (J. Energy Engr. 
1993: 43-55) gave data on various char- 
acteristics of subdivisions that could be 
used in deciding whether to provide elec- 
trical power using overhead lines or 
underground lines. Here are the values of 
the variable x = total length of streets 
within a subdivision: 


5320 4390 2100 1240 3060 4770 
360 3330 3380 340 1000 960 
530 3350 540 3870 1250 2400 

1120 2120 450 2250 2320 2400 

5700 5220 500 1850 2460 5850 

2730 1670 100 5770 3150 1890 
240 396 1419 2109 


a. Construct a stem-and-leaf display using 


the thousands digit as the stem and the 
hundreds digit as the leaf, and comment 
on various features of the display. 

a histogram using class 
boundaries 0, 1000, 2000, 3000, 4000, 
5000, and 6000. What proportion of 
subdivisions have total length less than 
2000? Between 2000 and 4000? How 
would you describe the shape of the 
histogram? 


The article cited in the previous exercise 
also gave the following values of the vari- 
ables y=number of culs-de-sac and 
z = number of intersections: 
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a. Construct a histogram for the y data. 
What proportion of these subdivisions 
had no culs-de-sac? At least one cul-de- 
sac? 

b. Construct a histogram for the z data. 
What proportion of these subdivisions 
had at most five intersections? Fewer 
than five intersections? 


How does the speed of a runner vary over 
the course of a marathon (a distance of 
42.195 km)? Consider determining both the 
time to run the first 5 km and the time to 
run between the 35 km and 40 km points, 
and then subtracting the former time from 
the latter time. A positive value of this 
difference corresponds to a runner slowing 
down toward the end of the race. The 
accompanying histogram is based on times 
of runners who participated in several dif- 
ferent Japanese marathons (“Factors 
Affecting Runners’ Marathon  Perfor- 
mance,” Chance, Fall 1993: 24-30). 

What are some interesting features of this 
histogram? What is a typical difference 
value? Roughly what proportion of the 
runners ran the late distance more quickly 
than the early distance? (Time differences 
are in seconds.) 


Frequency 


A 
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27. America used to be number 1 in the world 


23.5 
27.3 
22.5 
29.3 
24.1 
36.0 


28. Tire pressure monitoring 


for percentage of adults with four-year 
degrees, but it has recently dropped to 
19th. Here is data on the percentage of 
adults age 25 or older in each state who had 
a four-year degree as of 2015 (listed in 
alphabetical order, with the District of 
Columbia included): 


28.0 
28.8 
29.0 
23.0 
30.8 
36.3 


27.5 
30.8 
37.9 
34.9 
28.6 
32.9 


21.1 
25.9 
40.5 
36.8 
31.9 
19.2 


29.5 
32.3 
26.9 
26.3 
25.8 
27.8 


38.1 
24.1 
33.7 
34.2 
27.0 
25.7 


37.6 
26.7 
20.7 
28.4 
24.9 


30.0 
31.0 
27.1 
27.7 
27.6 


54.6 
22.3 
31.4 
26.1 
31.1 


a. Construct a dotplot, and comment on 
any interesting features. [Note: The 
values 54.6, 40.5, and 19.2 belong to 
DC, MA, and WV, respectively.] 

b. The national percentage of adults age 25 
or older with a four-year degree was 
29.8% in 2015. Would you obtain that 
same value by averaging the 51 num- 
bers provided? Why or why not? 


systems are 
increasingly common for vehicles, but such 
systems rarely include checking the spare 
tire. The article “A Statistical Study of Tire 
Pressures on Road-Going Vehicles” (SAE 
Intl. J. Passeng. Cars—Mech. Systems 
2016) provided the following data on the 
amount (psi) by which spare tires in a 
sample of 95 cars were under-inflated, rel- 
ative to manufacturer’s specifications: 


25 60 57 7 35 20 23 > «52 
20 60 40 52 9 7 57 57 55 
46 -6 19 32 -Il1 17. 50 1 57 

5 4 60 50 41 34 6 54 31 

0 29 19 12 50 52 6 3 41 
50 32 12 38 32 46 Sl 26 20 
20 30 8 0 42 16 41 35 45 
39 25 42 29 3. 60 20 1 0) 
30. 130 «(37 13 1606 15 0 2500 «(24025 

-12 10. «10 5 


a. What does a value of 0 represent here? 
What does a negative value represent? 
b. Construct a relative frequency histogram 
based on the class boundaries —20, 
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29. 


30. 


—10, 0, 10, ..., 50, and 60, and comment 
on features of the distribution. 

c. Construct a histogram based on the 
following class boundaries: —20, 0, 10, 
20, 30, 40, and 60. 

d. What proportion of spare tires in the 
sample were within + 10 psi of their 
manufacturer-recommended pressure? 


A transformation of data values by means 
of some mathematical function, such as \/x 
or 1/x, can often yield a set of numbers that 
has “nicer” statistical properties than the 
original data. In particular, it may be pos- 
sible to find a function for which the his- 
togram of transformed values is more 
symmetric (or, even better, more like a bell- 
shaped curve) than the original data. As an 
example, the article “The Negative Bino- 
mial Distribution as a Model for External 
Corrosion Defect Counts in Buried Pipeli- 
nes” (Corrosion Sci. 2015: 114-131) 
reported the number of defects in 50 oil and 
gas pipeline segments in southern Mexico. 


> 


518 274 37 46 85 365 40 378 18 
43 153 23 206 34 25 37 «125 684 
170 63 49 88 54 144 45 27 «14 
148 321 183 148 61 65 127 116 35 
46 81 156 59 26 88 33 104 44 


a. Use class intervals 0-<50, 50-<100, ... 
to construct a histogram of the original 
data. 

b. Transform the data by applying logjo() to 
all 50 values. Use class intervals 1.0- 
<1.2, 1.2-«1.4, ...,2.6-<2.8 to construct 
a histogram for the transformed data. 
What is the effect of the transformation? 


Unlike most packaged food products, 
alcohol beverage container labels are not 
required to show calorie or nutrient content. 
The article “What Am I Drinking? The 
Effects of Serving Facts Information on 
Alcohol Beverage Containers” (J. of 


31. 


32. 


23 
Consumer Affairs 2008: 81-99) reported on 
a pilot study in which each individual in a 
sample was asked to estimate the calorie 
content of a 12 oz can of light beer known 
to contain 103 cal. The following infor- 
mation appeared in the article: 


Class Percentage 
0-<50 7 
50-<75 9 
75—<100 23 
100-<125 31 
125-<150 12 
150—<200 3 
200-<300 12 
300—<500 3 


a. Construct a histogram of the data and 
comment on any interesting features. 

b. What proportion of the estimates were 
at least 100? Less than 200? 


The report “Majoring in Money 2019” 
(Sallie Mae) provides the following relative 
frequency distribution for the credit card 
balance of a nationally representative sam- 
ple of n = 464 college students: 


$0 1% 
$1-$100 15% 
$101-$500 35% 
$501-$1000 14% 
$1001-$2500 18% 
$2501-$5000 6% 
>$5000 5% 


a. Approximately how many students in 
the survey reported a $0 credit card 
balance? 

b. What proportion of students surveyed 
carry a balance greater than $1000? 

c. Based on the information provided, is it 
possible to construct a histogram of the 
data? Why or why not? 


The College Board reports the following 
Total SAT score distribution for 2018, the 
first year of the “new” SAT format: 
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33. 


34. 


Score range Frequency 
1400-1600 145,023 
1200-1390 434,200 
1000-1190 741,452 
800-990 619,145 
600-790 192,267 
400-590 4452 


a. Create a histogram of this data. Com- 
ment on its features. 

b. What is a typical, or representative, 
Total SAT score? 

c. What proportion of students in 2018 
scored between 800 and 1190? 


The article “Study on the Life Distribution 
of Microdrills’ (J. Engr. Manuf. 2002: 
301-305) reported the following observa- 
tions, listed in increasing order, on drill 
lifetime (number of holes that a drill 
machines before it breaks) when holes were 
drilled in a certain brass alloy. 


144 20 23 31 36 39 44 47 50 
61 65 67 68 Tl 74 +7 78 79 
89 91 93 «96 = 699 «101-104 
105 112 118 123 136 139 141 148 158 
206 248 263 289 322 388 513 


a. Construct a frequency distribution and 
histogram of the data using class 
boundaries 0, 50, 100, ..., and then 
comment on interesting characteristics. 

b. Construct a frequency distribution and 
histogram of the natural logarithms of 
the lifetime observations, and comment 
on interesting characteristics. 

c. What proportion of the lifetime observa- 
tions in this sample are less than 100? 
What proportion of the observations are 
at least 200? 


Consider the following data on type of 


health complaint (J =joint swelling, 
F = fatigue, B = back pain, M = muscle 
weakness, C = coughing, N = nose 


running/irritation, O = other) made by tree 
planters. Obtain frequencies and relative 
frequencies for various categories, and 
construct a bar chart. (The data is consistent 
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with percentages given in the article 
“Physiological Effects of Work Stress and 
Pesticide Exposure in Tree Planting by 
British Columbia Silviculture Workers,” 
Ergonomics 1993: 951-961.) 
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35. The report “Motorcycle Helmet Use in 


2005—Overall Results” (NHTSA August 
2005) included observations made on the 
helmet use of 1700 motorcyclists across the 
country. The data is summarized in the 
accompanying table. (A “noncompliant 
helmet” failed to meet U.S. Department of 
Transportation safety guidelines.) 


Category Frequency 
No helmet 731 
Noncompliant helmet 153 
Compliant helmet 816 
Total 1700 


a. Whatis the variable in this example? What 
are the possible values of that variable? 

b. Construct the relative frequency distri- 
bution of this variable. 

c. What proportion of observed motorcy- 
clists wore some kind of helmet? 

d. Construct an appropriate graph for this 
data. 


36. The author of the article “Food and Eating 


on Television: Impacts and Influences” 
(Nutr. and Food Sci. 2000: 24-29) exam- 
ined hundreds of hours of BBC television 
footage and categorized food images for 
both TV programs and commercials. The 
data presented here is consistent with 
information in the article; one of the 
research goals was to compare food images 
in ads and programs. 
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Number of food images 38. The cumulative frequency and cumula- 
Food category TV Programs Commercials tive relative frequency for a particular 
Milk and dairy products 149 99 class interval are the sum of frequencies and 
Breads, cereals, and potatoes 372 346 relative frequencies, respectively, for that 
ina Gannelriinmaaial pe ei interval and all intervals lying below it. If. 
Fruits and vegetables 694 32 uA 8 a8 
Fatty and sugary foods 322 511 for example, there are four intervals with 
Teal id LE! n= 1186 frequencies 9, 16, 13, and 12, then the 


cumulative frequencies are 9, 25, 38, and 


a. Why is : it Inappropriate to compare 50, and the cumulative relative frequencies 
frequencies (counts) between program are .18, .50, .76, and 1.00. Compute the 
images and commercial images? 


b. Obtain the relative frequency distribu- 
tion for the variable food category 
among images in TV programs. Create 
a graph of the distribution. 

c. Repeat part (b) for food images in 
commercials. 

d. Contrast the two distributions: what are 
the biggest differences between the 
types of food images in TV programs 
and those in commercials? 


cumulative frequencies and cumulative rel- 
ative frequencies for the data of Exercise 28, 
using the class intervals in part (c). 


39. Fire load (MJ/m*) is the heat energy that 
could be released per square meter of floor 
area by combustion of contents and the 
structure itself. The article “Fire Loads in 
Office Buildings” (J. Struct. Engr. 1997: 
365-368) gave the following cumulative 
percentages (read from a graph) for fire 


loads in a sample of 388 rooms: 
37. A Pareto diagram is a variation of a bar 


chart for categorical data resulting from a Value 0 150 300 450 600 
quality control study. Each category repre- = Cumulative % 0 193 376 62.7 77.5 
sents a different type of product noncon- Ye ee ee ee 
; aes Cumulative % = 87.2 93.8 =95.7 98.6 99.1 
formity or production problem. The Value 1500 1650 1800 1950 
categories are ordered so that the one with = Cumulative % = 99.5 99.6 = 99.8 100.0 


the largest frequency appears on the far left, 


then the category with the second-largest a. Construct a relative frequency histogram 
frequency, and so on. Suppose the follow- and comment on interesting features. 
ing information on nonconformities in cir- b. What proportion of fire loads are less 
cuit packs is obtained: failed component, than 600? At least 1200? 

126; incorrect component, 210; insufficient c. What proportion of the loads are 
solder, 67; excess solder, 54; missing com- between 600 and 1200? 


ponent, 131. Construct a Pareto diagram. 


1.3. Measures of Center 


Visual summaries of data are excellent tools for obtaining preliminary impressions and insights. More 
formal data analysis often requires the calculation and interpretation of numerical summary measures 
—numbers that might serve to characterize the data set and convey some of its most important 
features. 

Our primary concern will be with quantitative data. Suppose that our data set is of the form x), x2, 
..., X,, Where each x; is a number. What features of such a set of numbers are of most interest and 
deserve emphasis? One important characteristic of a set of numbers is its “center”: a single value that 
we might consider typical or representative of the entire data set. This section presents methods for 
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describing the center of a data set; in Section 1.4 we will turn to methods for measuring variability in 
a set of numbers. 


The Mean 


For a given set of numbers x1, X2, ..., X,, the most familiar and useful measure of the center is the 
mean, or arithmetic average, of the set. Because we will almost always think of the x,’s as constituting 
a sample, we will often refer to the arithmetic average as the sample mean and denote it by x. 


DEFINITION The sample mean x of observations x), x2, ..., X, is given by 


n 
= Mee et i 
os = 
n n 


The numerator of X can be written more informally as 5> x; where the 
summation is over all sample observations. 


For reporting x, we recommend using decimal accuracy of one digit more than the accuracy of the 
x;’s. Thus if observations are stopping distances with x, = 125, x, = 131, and so on, we might have 
xX = 127.3 ft. 


Example 1.11 Students in a class were assigned to make wingspan measurements at home. The 
wingspan is the horizontal measurement from fingertip to fingertip with outstretched arms. Here are 
the measurements (inches) given by 21 of the students. 


mM = 60 2 = 64 x3 = 72 Xu = 63 xs = 66 x = 62 x7 = 75 
xg = 66 Xo = 59 X19 = 75 xX. = 69 x12= 62 x13 = 63 X14 = 61 
X15 = 65 X16 = 67 17 = 65 X1g = 69 X19 = 95 X20 = 60 X21 = 70 


Figure 1.12 shows a stem-and-leaf display of the data; a wingspan in the 60s appears to be 
“typical.” 
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Figure 1.12 A stem-and-leaf display of the wingspan data 


With 5° x; = 1408, the sample mean is 


1408 
x = —— = 67.0 in, 
oy 
a value consistent with information conveyed by the stem-and-leaf display. a 


A physical interpretation of x demonstrates how it measures the center of a sample. Think of a 
dotplot in which each dot (i.e., each observation) “weighs” 1 lb. The only point at which a fulcrum 
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can be placed to balance the system of weights is the point corresponding to the value of x (see 
Figure 1.13). The system balances because, as shown in the next section, 5>(x; — x) = 0, so the net 
total tendency to turn about x is 0. 


Mean = 67.0 
e e@e ee | e e 
—S9P2eseeee OO #@ moj yj @ 
60 65 70 75 80 85 90 95 


Figure 1.13 The mean as the balance point for a system of weights 


Just as x represents the average value of the observations in a sample, the average of all values in a 
population can, in principle, be calculated. This average is called the population mean and will be 
denoted by the Greek letter u. When there are N values in the population (a finite population), then 
HL = (sum of the N population values)/N. In Chapters 3 and 4, we will give a more general definition 
for ys that applies to both finite and (conceptually) infinite populations. In the chapters on statistical 
inference, we will present methods based on the sample mean for drawing conclusions about a 
population mean. For example, we might use the sample mean x = 67.0 computed in Example 1.11 
as a point estimate (a single number that is our “best” guess) of uw = the true average wingspan for all 
students in introductory statistics classes. 

The mean suffers from one deficiency that makes it an inappropriate measure of center in some 
circumstances: its value can be greatly affected by the presence of even a single outlier (i.e., an 
unusually large or small observation). In Example 1.11, the value x;9 = 95 is obviously an outlier. 
Without this observation, X = 1313/20 = 65.7 in; the outlier increases the mean by 1.3 in. The value 
95 is clearly an error—this student was only 70 inches tall, and there is no way such a student could 
have a wingspan of almost 8 ft. As Leonardo da Vinci noticed, wingspan is usually quite close to 
height. (Note, though, that outliers are often not the result of recording errors!) 

We will next consider an alternative to the mean, namely the median, that is insensitive to outliers. 
However, the mean is still by far the most widely used measure of center, largely because there are 
many populations for which outliers are very scarce. When sampling from such a population (a 
“normal” or bell-shaped distribution being the most important example), outliers are highly unlikely 
to enter the sample. The sample mean will then tend to be stable and quite representative of the 
sample. 


The Median 

The word median is synonymous with “middle,” and the sample median is indeed the middle value 
when the observations are ordered from smallest to largest. When the observations are denoted by x,, 
..., Xn, We Will use the symbol x to represent the sample median. 


DEFINITION The sample median is obtained by first ordering the n observations from 
smallest to largest (with any repeated values included so that every sample 
observation appears in the ordered list). Then, if is odd, 

x = (44+) th ordered value 
whereas if n is even, 
x = average of the (3) th and (4 + 1)th ordered values 
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Example 1.12 People not familiar with classical music might tend to believe that a composer’s 
instructions for playing a particular piece are so specific that the duration would not depend at all on 
the performer(s). However, there is typically plenty of room for interpretation, and orchestral con- 
ductors and musicians take full advantage of this. We went to the website ArkivMusic.com and 
selected a sample of 12 recordings of Beethoven’s Symphony No. 9 (the “Choral,” a stunningly 
beautiful work), yielding the following durations (min) listed in increasing order: 


62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0 


Since n = 12 is even, the sample median is the average of the n/2 = 6th and (n/2 + 1) = 7th values 
from the ordered list: 


x= coders = 66.90 min 


Note that half of the durations in the sample are less than 66.90 min, and half are greater than that. 
If the largest observation 79.0 had not been included in the sample, the resulting sample median for 
the n = 11 remaining observations would have been the single middle value 66.4 (the [n + 1]/2 = 6th 
ordered value, i.e., the 6th value in from either end of the ordered list). 
The sample mean is ¥ = )> x;/n = 816.1/12 = 68.01 min, a bit more than a full minute larger 
than the median. The mean is pulled out a bit relative to the median because the sample “stretches 
out” somewhat more on the upper end than on the lower end. a 


The data in Example 1.12 illustrates an important property of x in contrast to x. The sample median 
is very insensitive to a number of extremely small or extremely large data values. If, for example, we 
increased the two largest x;’s from 75.7 and 79.0 to 95.7 and 99.0, respectively, x would be unaf- 
fected. Thus, in the treatment of outlying data values, x and x are at opposite ends of a spectrum: x is 
sensitive to even one such value, whereas x is insensitive to a large number of outlying values. 
Although x and x both provide a measure for the center of a data set, they will not in general be equal 
because they focus on different aspects of the sample. 

Analogous to x as the middle value in the sample is a middle value in the population, the 
population median, denoted by ji. As with x and y, we can think of using the sample median x to 
make an inference about ji. In Example 1.12, we might use x = 66.90 min as an estimate of the 
median duration in the entire population from which the sample was selected. Or, if the median salary 
for a sample of statisticians was x = $96,416, we might use this as a basis for concluding that the 
median salary jt for all statisticians exceeds $90,000. 

The population mean 4 and median jt will not generally be identical. If the population distribution 
is positively or negatively skewed, as shown in Figure 1.14 (p. 29), then x 4 ji. When this is the case, 
in making inferences we must first decide which of the two population characteristics is of greater 
interest and then proceed accordingly. As an example, according to the report “How America Saves 
2019” issued by the Vanguard Funds investment company, the mean retirement fund balance among 
workers 65 and older is $192,877, whereas the median balance is just $58,035. Clearly a small 
minority of such people has extremely large retirement fund balances, inflating the mean relative to 
the median; the latter is arguably a better representation of a “typical” retirement fund balance. 
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Figure 1.14 Three different shapes for a population distribution 


Trimmed Means 

The sample mean and sample median are influenced by outlying values to very different degrees—the 
mean greatly and the median not at all. Since extreme behavior of either type might be undesirable, 
we briefly consider alternative measures that are neither as sensitive as X nor as insensitive as x. To 
motivate these alternatives, note that x and x are at opposite extremes of the same “family” of 
measures: while x is computed by throwing away as many values on each end as one can without 
eliminating everything (leaving just one or two middle values), to compute x one throws away 
nothing before averaging. Said differently, the mean involves “trimming” 0% from each end of the 
sample, whereas for the median the maximum possible amount is trimmed from each end. A trimmed 
mean is a compromise between x and x. A 10% trimmed mean, for example, would be computed by 
eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains. 


Example 1.13 Consider the following 20 observations, ordered from smallest to largest, each one 
representing the lifetime (in hours) of a type of incandescent lamp: 


612 623 666 744 883 898 964 970 983 1003 
1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 


The mean and median of all 20 observations are x = 965.0 h and x = 1009.5 h, respectively. The 
10% trimmed mean is obtained by deleting the smallest two observations (612 and 623) and the 
largest two (1197 and 1201) and then averaging the remaining 16 to obtain X4(19) = 979.1 h. The 
effect of trimming here is to produce a “central value” that is somewhat above the mean (x is pulled 
down by a few small lifetimes) and yet considerably below the median. Similarly, the 20% trimmed 
mean averages the middle 12 values to obtain X29) = 999.9, even closer to the median. See 
Figure 1.15. 
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Figure 1.15 Dotplot of lifetimes (in hours) of incandescent lamps a 


Generally speaking, using a trimmed mean with a moderate trimming proportion (between 5% and 
25%) will yield a measure that is neither as sensitive to outliers as the mean nor as insensitive as the 
median. For this reason, trimmed means have merited increasing attention from statisticians for both 
descriptive and inferential purposes. More will be said about trimmed means when point estimation is 
discussed in Chapter 7. As a final point, if the trimming proportion is denoted by « and na is not an 
integer, then it is not obvious how the 100«% trimmed mean should be computed. For example, if 
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a = .10 (10%) and n = 22, then na = (22)(.10) = 2.2, and we cannot trim 2.2 observations from each 
end of the ordered sample. In this case, the 10% trimmed mean would be obtained by first trimming 
two observations from each end and calculating x, then trimming three and calculating x, and 
finally interpolating between the two values to obtain X10). 


Exercises: Section 1.3 (40-51) 


40. The website realtor.com listed the follow- 


41. 


42. 


ing sale prices (in $1000s) for a sample of 
10 homes sold in 2019 in Los Osos, CA 
(home town of two of the authors): 


525 830 600 180 129 525 350 490 640 475 


a. Calculate and interpret the sample mean 
and median. 

b. Suppose the second observation was 
930 instead of 830. How would that 
affect the mean and median? 

c. The two low outliers in the sample were 
mobile homes. If we excluded those two 
observations, how would that affect the 
mean and median? 

d. Calculate a 20% trimmed mean by first 
trimming the two smallest and two lar- 
gest observations. 

e. Calculate a 15% trimmed mean. 


Super Bowl LIII was the lowest scoring 
(and, to many, the least exciting) Super 
Bowl of all time. During the game, Los 
Angeles Rams running back Todd Gurley 
had just 10 rushing plays, resulting in the 
following gains in yards: 


5.2 12 3 -1 162 5 0 

a. Determine the value of the mean. 

b. Determine the value of the median. Why 
is it so different from the mean? 

c. Calculate a trimmed mean by deleting 
the smallest and largest observations. 
What is the corresponding trimming 
percentage? How does the value of this 
trimmed mean compare to the mean and 


median? 


The minimum injection pressure (psi) for 
injection molding specimens of high amy- 
lose corn was determined for eight different 


43. 


44. 


specimens (higher pressure corresponds to 
greater processing difficulty), resulting in 
the following observations (from “Ther- 
moplastic Starch Blends with a 
Polyethylene-Co-Vinyl Alcohol: Process- 
ability and Physical Properties,” Polymer 
Engr. and Sci. 1994: 17-23): 


a. Determine the values of the sample 
mean, sample median, and 12.5% trim- 
med mean, and compare these values. 

b. By how much could the smallest sample 
observation, currently 8.0, be increased 
without affecting the value of the sample 
median? 

c. Suppose we want the values of the sample 
mean and median when the observations 
are expressed in kilograms per square 
inch (ksi) rather than psi. Is it necessary to 
re-express each observation in ksi, or can 
the values calculated in part (a) be used 
directly? [Hint: 1 kg = 2.2 lb.] 


Here is the average weekday circulation 
(paper plus digital subscriptions) for the top 
20 newspapers in the country (247wallst.- 
com, January 24, 2017): 


2,237,601 512,118 507,395 424,721 410,587 
384,962 356,768 299,538 291,991 285,129 
276,445 246,963 245,042 243,376 242,567 
232,546 232,372 227,245 215,476 196,286 


a. Which value, the mean or the median, 
do you anticipate will be higher? Why? 

b. Calculate the mean and median for this 
data. 


An article in the Amer. J. Enol. and Viti. 
(2006: 486-490) includes the following 
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16.35 
18.85 
16.20 
17.75 
19.58 


45. 


46. 


389 
373 
392 


47. 


alcohol content measurements (%) for a 
sample of n = 35 port wines. 


17.73 
22.75 
23.78 
23.25 
19.08 


19.62 
19.20 
20.05 
17.85 
19.17 


19.07 
19.90 
18.68 
18.82 
19.03 


19.48 
20.00 
19.97 
17.48 
17.15 


19.45 
19.37 
19.20 
18.00 
19.60 


19.33 
21:22 
19.50 
15.30 
22.25 


a. Graph the data. Based on the graph, 
what is a representative value for the 
alcohol content in port wines? 

b. Calculate the mean and the median. Are 
these values consistent with your 
answer in (a)? Why or why not? 


Compute the sample median, 25% trimmed 
mean, 10% trimmed mean, and sample 
mean for the microdrill data given in 
Exercise 33, and compare these measures. 


A sample of 26 offshore oil workers took 
part in a simulated escape exercise, result- 
ing in the accompanying data on time 
(sec) to complete the escape (“Oxygen 
Consumption and Ventilation During 
Escape from an Offshore Platform,” Ergo- 
nomics 1997: 281-292): 


356 
373 
369 


359 
370 
374 


363 
364 
359 


375 
366 
356 


424 
364 
403 


325 
325 
334 


394 
339 
397 


402 
393 


a. Construct a stem-and-leaf display of the 
data. How does it suggest that the sam- 
ple mean and median will compare? 

b. Calculate the values of the sample mean 
and median. 

c. By how much could the largest time, 
currently 424, be increased without 
affecting the value of the sample med- 
ian? By how much could this value be 
decreased without affecting the value of 
the sample median? 

d. What are the values of x and x when the 
observations are re-expressed in min- 
utes? 


Blood pressure values are often reported to 
the nearest 5 mmHg (100, 105, 110, etc.). 
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Suppose the actual blood pressure values 
for nine randomly selected individuals are 


118.6 127.4 138.4 130.0 113.7 122.0 108.3 131.5 133.2 


48. 


49. 


48 


50. 


a. What is the median of the reported 
blood pressure values? 

b. Suppose the blood pressure of the sec- 
ond individual is 127.6 rather than 
127.4 (a small change in a single value). 
How does this affect the median of the 
reported values? What does this say 
about the sensitivity of the median to 
rounding or grouping in the data? 


Let x;,...,x, be a sample, and let a and 
b be constants with a # 0. Define a new 
sample y,...,yn by yi =ax,+b, ..., 
Vn = AX, +b. 


a. How does the sample mean of the y,’s 
relate to the mean of the x;’s? Verify 
your conjectures. 

b. How does the sample median of the y;’s 
relate to the median of the x;’s? Sub- 
stantiate your assertion. 


An experiment to study the lifetime (in 
hours) for a certain type of component 
involved putting ten components into 
operation and observing them for 100 h. 
Eight of the components failed during 
that period, and those lifetimes were 
recorded. Denote the lifetimes of the two 
components still functioning after 100 h 
by 100+. The resulting sample observa- 
tions were 


79 100+ 35 92 86 57 100% 17 29 
Which of the measures of center discussed 
in this section can be calculated, and what 
are the values of those measures? [Note: 
The data from this study is said to be 
“right-censored.”’] 


A sample of n= 10 automobiles was 
selected, and each was subjected to a 5-mph 
crash test. Denoting a car with no visible 
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damage by S (for success) and a car 5]. Refer back to Example 1.10, in which 120 
with such damage by F, results were as motorcycle owners were asked to specify 
follows: their bikes’ manufacturer. 


a. Is the variable manufacturer quantita- 
tive or categorical? 

b. Based on the sample data, what would 
you consider a “typical” or “represen- 
tative” value for the variable, and why? 

c. Suppose the responses were recoded 
according to the numbering indicated in 
Table 1.2. (1 = Honda, 2 = Yamaha, 
etc.), resulting in a data set consisting of 
41 1’s, 27 2’s, and so on. Would it be 
reasonable to use the mean of these 120 
numbers as a representative value? 
What about the median? Explain. 


a. What is the sample proportion of 
successes? 

b. Replace each S with a | and each F with 
a 0. Then calculate x for this numerically 
coded sample. How does x compare to 
the sample proportion of successes? 

c. Suppose it is decided to include 15 more 
cars in the experiment. How many of 
these would have to be S’s to give a 
sample proportion of .80 for the entire 
sample of 25 cars? 


1.4 Measures of Variability 


Reporting a measure of center gives only partial information about a data set or distribution. Different 
samples or populations may have identical measures of center yet differ from one another in other 
important ways. Figure 1.16 shows dotplots of three samples with the same mean and median, yet the 
extent of spread about the center is different for all three samples. The first sample has the largest 
amount of variability, the second has less variability than the first, and the third has the smallest 
amount. 
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Figure 1.16 Samples with identical measures of center but different amounts of variability 


Measures of Variability for Sample Data 

The simplest measure of variability in a sample is the range, which is the difference between the 
largest and smallest sample values. Notice that the value of the range for sample | in Figure 1.16 is 
much larger than it is for sample 3, reflecting more variability in the first sample than in the third. 
A defect of the range, though, is that it depends on only the two most extreme observations and 
disregards the positions of the remaining n — 2 values. Samples | and 2 in Figure 1.16 have identical 
ranges, yet when we take into account the observations between the two extremes, there is much less 
variability or dispersion in the second sample than in the first. 
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Our primary measures of variability will involve the n deviations from the mean: x; — X, x2. — x, 
..,X, — X obtained by subtracting x from each sample observation. A deviation will be positive if the 
observation is larger than the mean (to the right of the mean on the measurement axis) and negative if 
the observation is smaller than the mean. If all the deviations are small in magnitude, then all x,’s are 
close to the mean and there is little variability. On the other hand, if some of the deviations are large 
in magnitude, then some x;’s lie far from x, suggesting a greater amount of variability. 
A simple way to combine the deviations into a single quantity is to average them (sum them and 
divide by n). Unfortunately, this does not yield a useful measure, because the positive and negative 
deviations counteract one another: 


n 
sum of deviations = » (x; — xX) =0 
i=l 


Thus the average deviation is always zero. To see why, use standard rules of summation and the fact 
that 3X =*4+KX+4+ ++. +X = nx: 


Ya) =D Dx= Lu at= Daal) =0 


Another possibility is to base a measure on the absolute values of the deviations, in particular the 
mean absolute deviation S> |x; —x|/n. But because the absolute value operation leads to some 
calculus-related difficulties, statisticians instead work with the squared deviations 


(x; —X)*, (x. —X)°,..., (%, — X)°. Rather than use the average squared deviation S> (x; — x)’ /n, for 
several reasons the sum of squared deviations is divided by n — | rather than n. 


DEFINITION Let Sy. = > (x; - x), the sum of the squared deviations from the mean. 
Then the sample standard deviation, denoted by s, is given by 


Sixx _ > (x; — x) 


n—1l n—-1 


The quantity s” is called the sample variance. 


The unit for s is the same as the unit for each of the x;’s. If, for example, the observations are fuel 
efficiencies in miles per gallon, then we might have s = 2.0 mpg. A rough interpretation of the sample 
standard deviation is that it represents the size of a typical deviation from the sample mean within the 
given sample. Thus if s = 2.0 mpg, then some x;,’s in the sample are closer than 2.0 to x, whereas 
others are farther away; 2.0 is a representative (or “standard’’) deviation from the mean fuel efficiency. 
If s = 3.0 for a second sample of cars of another type, a typical deviation in this sample is roughly 1.5 
times what it is in the first sample, an indication of more variability in the second sample. 


Example 1.14 The website www.fueleconomy.gov contains a wealth of information about fuel 
characteristics of various vehicles. In addition to EPA mileage ratings, there are many vehicles for 
which users have reported their own values of fuel efficiency (mpg). Consider Table 1.3 with n = 10 
efficiencies for the 2015 Toyota Camry (for this model, the EPA reports an overall rating of 25 mpg in 
city driving and 34 mpg in highway driving). 
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Table 1.3 Data for Example 1.14 


= 2 


Xi Xj — xX (x; = x)” 
1 31.0 2.1 4.41 
2 27.8 -11 1.21 
3 38.3 9.4 88.36 
4 27.0 -1.9 3.61 
5 23.4 5.5 30.25 
6 30.0 11 1.21 
7 30.1 1.2 1.44 
8 21.5 -7.4 54.76 
9 25.4 3.5 12.25 
10 34.5 5.6 31.36 
So x; = 289.0 +(x; — x) = 0.0 (xi — ¥)” = 228.86 
¥ = 28.9 


The numerator of s” is Sy, = 228.86, from which 


o 228. 
s= / z = / = = V25.43 = 5.04 mpg 
n—1 10-1 


The size of a typical difference between a driver’s fuel efficiency and the mean of 28.9 in this sample 
is roughly 5.04 mpg. ei 


To explain why n — | rather than n is used to compute s, note first that whereas s measures 
variability in a sample, there is a measure of population variability called the population standard 
deviation. We will use o (lowercase Greek letter sigma) to denote the population standard deviation 
and o* to denote the population variance. When the population is finite and consists of N values, 


0 =o (x;— w)?/N 


i=l 


which is the average of all squared deviations from the population mean (for the population, the 
divisor is N and not N — 1). More general definitions of o° for (conceptually) infinite populations 
appear in Chapters 3 and 4. 

Just as x will be used to make inferences about the population mean p, we should define the 
sample standard deviation s so that it can be used to make inferences about a. Note that o involves 
squared deviations about the population mean wp. If we actually knew the value of pw, then we could 
define the sample standard deviation as the average squared deviation of the sample x;’s about y. 
However, the value of u is almost never known, so the sum of squared deviations about x must be 
used in the definition of s. But the x;’s tend to be closer to their own average x than to the population 
average i. Using the divisor n — 1 rather than n compensates for this tendency. A more formal 
explanation for this choice appears in Chapter 7. 

It is customary to refer to s as being based on n — 1 degrees of freedom (df). This terminology 


results from the fact that although s is based on the n quantities x, — X,x2. — X,...,X, —X, these sum 
to 0, so specifying the values of any n — | of the quantities determines the remaining value. For 
example, ifn = 4 and x; — x = 8, x.» —x = —6, andx, — x = —4, then automatically x3 — x = 2, so 


only three of the four values of x; — x are “freely determined” (3 df). 
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A Computing Formula for s 

Typically, software or a calculator is used to compute summary quantities such as x and s. Otherwise, 
computing and squaring the deviations can be tedious, especially if enough decimal accuracy is being 
used in x to guard against the effects of rounding. An alternative formula for the numerator of s° 
circumvents the need for all the subtraction necessary to obtain the deviations. 


PROPOSITION An alternative expression for the numerator of s? is 


Se =o — n(x = — (ex 


Proof Because ¥ = 7 x;/n, nx? = n(S>xi)"/n = (SDxi)”/n. Then, 
SR ieee 
= Sox - 2 nttn-P = So x—n(z =e 2 (ea) a 


Example 1.15 Traumatic knee dislocation often requires surgery to repair ruptured ligaments. One 
measure of recovery is range of motion, measured as the angle formed when, starting with the leg 
straight, the knee is bent as far as possible. The given data on postsurgical range of motion appeared 
in the article “Reconstruction of the Anterior and Posterior Cruciate Ligaments After Knee Dislo- 
cation” (Amer. J. Sports Med. 1999: 189-197): 


154 142 137 133 122 126 135 135 108 120 127 134 122 
The sum of these 13 sample observations is }> x; = 1695, and the sum of their squares is 
So x7 = 154? + 1427 + --- +122? = 222,581 


Thus the numerator of the sample variance is 
2 
Sx = S10 - OS xi) /n = 222,581 — (1695)"/13 = 1579.0769 


from which s” = 1579.0769/12 = 131.59 and s = V/131.59 = 11.47 degrees. a 


If our data is rescaled—for instance, changing Celsius temperature measurements to Fahrenheit— 
the standard deviation of the rescaled data can easily be determined from the standard deviation of the 
original values. 
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PROPOSITION Let x), X2,..., X, be a sample and c be a constant. 


1. If yy =x, +, yz =X2 + C,..., Vn =Xy + €, then s, = s,; and 


= = 2_ 22 - 
2. If y) = CX1,..-, Yn = CXp, then s\ = c*sy and sy = \clsx 


where s, is the sample standard deviation of the x;’s and s, is the 
sample standard deviation of the y,’s. 


Result | is intuitive, because adding or subtracting c shifts the location of the data set but leaves 
distances between data values unchanged. According to Result 2, multiplication of each x; by c results 
in s being multiplied by a factor of |c|. Verification of these results utilizes the properties y = X¥+c 
and y = cx (see Exercise 72). 


Quartiles and the Interquartile Range 

In Section 1.3, we discussed the sensitivity of the sample mean x to outliers. Since the standard 
deviation is based on measurements from the mean, s is also heavily influenced by outliers. (In fact, 
the effect of outliers on s can be especially severe, since each deviation is squared during compu- 
tation.) It is therefore desirable to create a measure of variability that is “resistant” to the presence of a 
few outliers, analogous to the median. 


DEFINITION Order the n observations from smallest to largest, and separate the lower half from 
the upper half; the median is included in both halves if n is odd. The lower quartile 
(or first quartile), g,, is the median of the lower half of the data, and the upper 
quartile (or third quartile), g3, is the median of the upper half.! 
A measure of spread that is resistant to outliers is the interquartile range (iqr), 
given by 
igr = 93-41 


The term quartile comes from the fact that the lower quartile divides the smallest quarter of obser- 
vations from the remainder of the data set, while the upper quartile separates the top quarter of values 
from the rest. The interquartile range is unaffected by observations in the smallest 25% or the largest 
25% of the data—hence, it is robust against (resistant to) outliers. Roughly speaking, we can interpret 
the iqr as the range of the “middle 50%” of the observations. 


Example 1.16 Consider the ordered fuel efficiency data from Example 1.14: 
21.5 23.4 25.4 27.0 27.8 | 30.0 30.1 31.0 34.5 38.3 


The vertical line separates the two halves of the data; the median efficiency is x = (27.8 + 30.0)/2 = 28.9 
mpg, coincidentally exactly the same as the mean. The quartiles are the middle values of the two halves; 
from the displayed data, we see that 


gi =25.4 gq, =31.0 ir = 31.0—25.4 = 5.6mpg 


The software package R reports the upper and lower quartiles to be 25.8 and 30.775, respectively, 
while JMP and Minitab both give 24.9 and 31.875. 


‘Different software packages calculate the quartiles (and, thus, the iqr) somewhat differently, for example using different 
interpolation methods between x values. For smaller data sets, the difference can be noticeable; this is typically less of 
an issue for larger data sets. 
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Imagine that the lowest value had been 10.5 instead of 21.5 (indicating something very wrong with 
that particular Camry!). Then the sample standard deviation would explode from 5.04 mpg (see 
Example 1.14) to 7.46 mpg, a nearly 50% increase. Meanwhile, the quartiles and the iqr would not 
change at all; those quantities would be unaffected by this low outlier. a 


The quartiles and interquartile range lead to a popular statistical convention for defining outliers 
(i.e., unusual observations) first proposed by renowned statistician John Tukey. 


DEFINITION Any observation farther than 1.5igr from the closest quartile is an outlier. 
An outlier is extreme if it is more than 3iqr from the nearest quartile, 
and it is mild otherwise. 


That is, outliers are defined to be all x values in the sample that satisfy either 


x<q,—1.5iqr or x >q3+4+1.Siqr 


Boxplots 

In Section 1.2, several graphical displays (stem-and-leaf, dotplot, histogram) were introduced as tools 
for visualizing quantitative data. We now introduce one more graph, the boxplot, which relies on the 
quartiles, iqr, and aforementioned outlier rule. A boxplot shows several of a data set’s most prominent 
features, including center, spread, the extent and nature of any departure from symmetry, and outliers. 


Constructing a Boxplot 

1. Draw a measurement scale (horizontal or vertical). 

2. Draw a rectangle adjacent to this axis beginning at g, and ending at q3 (so rectangle length = iqr). 

3. Place a line segment at the location of the median. (The position of the median symbol relative to 
the two edges conveys information about the skewness of the middle 50% of the data.) 

4. Determine which data values, if any, are outliers. Mark each outlier individually. (We may use 
different symbols for mild and extreme outliers; most statistical software packages do not make a 
distinction.) 

5. Finally, draw “whiskers” out from either end of the rectangle to the smallest and largest 
observations that are not outliers. 


Example 1.17 The Clean Water Act and subsequent amendments require that all waters in the USA 
meet specific pollution reduction goals to ensure that water is “fishable and swimmable.” The article 
“Spurious Correlation in the USEPA Rating Curve Method for Estimating Pollutant Loads” 
(J. Environ. Engr. 2008: 610-618) investigated various techniques for estimating pollutant loads in 
watersheds; the authors discuss “the imperative need to use sound statistical methods” for this 
purpose. Among the data considered is the following sample of total nitrogen loads (TN, in kg of 
nitrogen/day) from a particular Chesapeake Bay location, displayed here in increasing order. 


9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43 
30.75 31.54 35.07 36.99 40.32 42.51 45.64 48.22 
49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24 
66.14 67.68 81.40 90.80 92.17 92.42 100.82 101.94 

103.61 106.28 106.80 108.69 114.61 120.86 124.54 143.27 
143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61 
312.45 352.09 371.47 444.68 460.86 563.92 690.11 826.54 


1529.35 
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Relevant summary quantities are 


x = 92.17 qi = 45.64 q3 = 167.79 
iqr = 122.15 L.5iqr = 183.225 3iqr = 366.45 


Again, software packages may report slightly different values. Subtracting 1.5iqr from the lower 
quartile gives a negative number, and none of the observations are negative, so there are no outliers 
on the lower end of the data. However, 


g3+1.5igr = 351.015 and q3 + 3iqr = 534.24 


Thus the four largest observations—563.92, 690.11, 826.54, and 1529.35—are extreme outliers, and 
352.09, 371.47, 444.68, and 460.86 are mild outliers. 

The whiskers in the boxplot in Figure 1.17 extend out to the smallest observation 9.69 on the low 
end and 312.45, the largest observation that is not an outlier, on the upper end. There is some positive 
skewness in the middle half of the data (the right edge of the box is somewhat further from the median 
line than is the left edge) and a great deal of positive skewness overall. 


0 200 400 600 800 1000 1200 1400 1600 
Daily nitrogen load 


Figure 1.17 A boxplot of the nitrogen load data showing mild and extreme outliers a 


Placing individual boxplots side by side can reveal similarities and differences between two or 
more data sets consisting of observations on the same variable. 


Example 1.18 Chronic kidney disease (CKD) affects vital systems throughout the body, including 
the production of fibrinogen, a protein that helps in the formation of blood clots. (Both too much and 
too little fibrinogen are dangerous.) The article “Comparison of Platelet Function and Viscoelastic 
Test Results between Healthy Dogs and Dogs with Naturally Occurring [CKD]” (Amer. J. Veterinary 
Res. 2017: 589-600) compared the fibrinogen levels (mg/dl of blood) in 11 dogs with CKD to 10 dogs 
with normal kidney function. Figure 1.18 presents a stem-and-leaf display of the data (some values 
were estimated from a graph in the article). 
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Healthy dogs Dogs with CKD 
9| 0 
87766 | 1 | 89 
50 | 2 | 579 

1} 3 | 123 
4 |1 
5 10 
6 Stem: Hundreds digit 
7 Leaf: Tens digit 
8 | 2 


Figure 1.18 Stem-and-leaf display for Example 1.18 


Numerical summary quantities are as follows: 


x x Ss iqr 
Healthy 190.7 179.5 57.0 36.0 
CKD 353.1 315.0 179.7 107.5 


The values of the mean and median suggest that fibrinogen levels are much higher in dogs with 
CKD than in healthy dogs. Moreover, the variability in fibrinogen levels is much greater in the 
unhealthy dogs: the interquartile range for dogs with CKD (107.5 mg/dl) is nearly triple the value for 
healthy dogs. Figure 1.19 shows side-by-side boxplots from the JMP software package. There is 
obviously a systematic tendency for fibrinogen levels to be higher in the CKD group than the healthy 
group, and there is much more variability in the former group than in the latter one. Aside from the 
single outlier in each group, there is a reasonable amount of symmetry in both distributions. 


Dog group 


CKD Healthy 
900 


500 


Fibrinogen level 


100 


Figure 1.19 Comparative boxplots of the data in Example 1.18, from JMP 


The authors of the article conclude that chronic kidney disease in dogs can lead to “hypercoag- 
ulability” (i.e., overclotting), which presents very serious health risks. H 
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Exercises: Section 1.4 (52-72) 


52. 


53. 


29.5 


54. 


Here is the data on fibrinogen levels 
(mg/dl) for 10 healthy dogs and 11 dogs 
with chronic kidney disease discussed in 
Example 1.18: 


Healthy: 99 160 165 170 178 181 190 201 
250 313 

CKD: 183 190 250 275 290 315 320 330 
410 500 821 


a. For the data on the 10 healthy dogs, 
calculate the range, variance, standard 
deviation, quartiles, and interquartile 
range. 

b. Repeat part (a) for the 11 dogs with 
CKD. 


The article “Oxygen Consumption During 
Fire Suppression: Error of Heart Rate 
Estimation” (Ergonomics 1991: 1469- 
1474) reported the following data on oxy- 
gen consumption (mL/kg/min) for a sample 
of ten firefighters performing a fire- 
suppression simulation: 


49.3 30.6 28.2 28.0 26.3 33.9 29.4 23.5 31.6 


Compute the following: 


a. The sample range 

b. The sample variance s* from the defi- 
nition (by first computing deviations, 
then squaring them, etc.) 

c. The sample standard deviation 

d. s* using the shortcut method 


The value of Young’s modulus (GPa) was 
determined for cast plates consisting of 
certain intermetallic substrates, resulting in 
the following sample observations (“Strength 
and Modulus of a Molybdenum-Coated 
Ti-25 Al-1ONb-3U-1Mo Intermetallic,” 
J. Mater. Engr. Perform. 1997: 46-50): 


116.4 115.9 114.6 115.2 115.8 


a. Calculate x and the deviations from the 
mean. 

b. Use the deviations calculated in part 
(a) to obtain the sample variance and the 
sample standard deviation. 


55. 


56. 


87 


57. 


58. 
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c. Calculate s? by using the computational 
formula for the numerator S,,. 

d. Subtract 100 from each observation to 
obtain a sample of transformed values. 
Now calculate the sample variance of 
these transformed values, and compare it 
to s* for the original data. State the 
general principle. 


The accompanying observations on stabi- 
lized viscosity (cP) for specimens of a cer- 
tain grade of asphalt with 18% rubber added 
are from the article “Viscosity Characteris- 
tics of Rubber-Modified Asphalts” 
(J. Mater. Civil Engr. 1996: 153-156): 


2781 2900 3013 2856 2888 


a. What are the values of the sample mean 
and sample median? 

b. Calculate the sample variance using the 
computational Formula. [Hint: First 
subtract a convenient number from each 
observation. ] 


Calculate and interpret the values of the 
sample median, sample mean, and sample 
standard deviation for the following obser- 
vations on fracture strength (MPa, read 
from a graph in “Heat-Resistant Active 
Brazing of Silicon Nitride: Mechanical 
Evaluation of Braze Joints,” Welding J., 
Aug. 1997): 


93 96 98 105 114 128 131 142 168 
Exercise 46 in Section 1.3 presented a 
sample of 26 escape times for oil workers 
in a simulated exercise. Calculate and 
interpret the sample standard deviation. 
[Hint: >> x; = 9638 ya = 


3,587,566. ] 


and 


Acrylamide is a potential carcinogen that 
forms in certain foods, such as potato chips 
and French fries. The FDA analyzed 
McDonald’s French fries purchased at 
seven different locations; the following are 
the resulting acrylamide levels (in micro- 
grams per kg of food): 
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59. 


37 
290 
1139 


60. 


61. 


497 193 328 155 326 245 270 


Calculate > x; and > x? and then s* and s. 


In 1997 a woman sued a computer key- 
board manufacturer, charging that her 
repetitive stress injuries were caused by the 
keyboard (Genessy v. Digital Equipment 
Corp.). The jury awarded about $3.5 mil- 
lion for pain and suffering, but the court 
then set aside that award as being unrea- 
sonable compensation. In making this 
determination, the court identified a “nor- 
mative” group of 27 similar cases and 
specified a reasonable award as one within 
two standard deviations of the mean of the 
awards in the 27 cases. The 27 awards (in 
$1000s) were 


60 75 IIS 135 140 149 150 238 
340 410 600 750 750 750 1050 1100 
1150 1200 1200 1250 1576 1700 1825 2000 


from which J >x;= 20,179, Sox? = 
24,657,511. What is the maximum possible 
amount that could be awarded under the 
two standard deviation rule? 


The US Women’s Swimming Team won 
the 1500 m relay at the 2016 Olympic 
Games. Here are the completion times, in 
seconds, for all eight teams that competed 
in the finals: 


233.13 235.00 235.01 235.18 
235.49 235.66 236.96 239.50 


a. Calculate the sample variance and 
standard deviation. 

b. If the observations were re-expressed in 
minutes, what would be the resulting 
values of the sample variance and sam- 
ple standard deviation? Answer without 
actually performing the reexpression. 


The first four deviations from the mean in a 
sample of n = 5 reaction times were .3, .9, 
1.0, and 1.3. What is the fifth deviation from 
the mean? Give a sample for which these are 
the five deviations from the mean. 


62. 


63. 


64. 


41 
Reconsider the data on recent home sales 
(in $1000s) provided in Exercise 40: 


525 830 600 180 129 
525 350 490 640 475 


a. Determine the upper and lower quar- 
tiles, and then the iqr. 

b. If the two largest sample values, 830 
and 640, had instead been 930 and 740, 
how would this affect the iqr? Explain. 

c. By how much could the observation 129 
be increased without affecting the iqr? 
Explain. 

d. If an 11th observation, x,; = 845, is 
added to the sample, what now is the iqr? 


Reconsider the court awards data presented 
in Exercise 59. 


a. What are the values of the quartiles, and 
what is the value of the iqr? 

b. How large or small does an observation 
have to be to qualify as an outlier? As 
an extreme outlier? 

c. Construct a boxplot, and comment on its 
features. 


Here is a stem-and-leaf display of the 
escape time data introduced in Exercise 46. 


32 55 

33 49 

34 

35 6699 
36 34469 
37 03345 
38 9 

39 2347 
40 23 

41 

42 4 


a. Determine the value of the interquartile 
range. 

b. Are there any outliers in the sample? 
Any extreme outliers? 

c. Construct a boxplot and comment on its 
features. 

d. By how much could the largest obser- 
vation, currently 424, be decreased 
without affecting the value of the iqr? 
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65. 


66. 


67. 


Many people who believe they may be 
suffering from the flu visit emergency 
rooms, where they are subjected to long 
waits and may expose others or themselves 
be exposed to various diseases. The article 
“Drive-Through Medicine: A Novel Pro- 
posal for the Rapid Evaluation of Patients 
During an Influenza Pandemic” (Annals 
Emerg. Med. 2010: 268-273) described an 
experiment to see whether patients could be 
evaluated while remaining in their vehicles. 
The following total processing times 
(min) for 38 individuals were read from a 
graph that appeared in the cited article: 


16 16 17 19 20 20 20 
23 23 23 24 24 24 24 
25 26 26 27 27 28 28 
29 29 30 32 33 33 34 
43 44 46 48 53 


a. Calculate several different measures of 
center and compare them. 

b. Are there any outliers in this sample? 
Any extreme outliers? 

c. Construct a boxplot and comment on 
any interesting features. 


Here is summary information on the alco- 
hol percentage for a sample of 25 beers: 


ga =435 x=5 g=5.95 


The bottom three are 3.20 (Heineken Pre- 
mium Light), 3.50 (Amstel light), 4.03 
(Shiner Light) and the top three are 7.50 
(Terrapin All-American Imperial Pilsner), 
9.10 (Great Divide Hercules Double IPA), 
11.60 (Rogue Imperial Stout). 


a. Are there any outliers in the sample? 
Any extreme outliers? 

b. Construct a boxplot, and comment on 
any interesting features. 


A company utilizes two different machines 
to manufacture parts of a certain type. 
During a single shift, a sample of n = 20 
parts produced by each machine is 
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obtained, and the value of a particular 
critical dimension for each part is deter- 
mined. The comparative boxplot below is 
constructed from the resulting data. Com- 
pare and contrast the two samples. 


Machine 


tT! 
ye. 


85 95 105 115 


i) 


Dimension 


68. Blood cocaine concentration (mg/L) was 


determined both for a sample of individuals 
who had died from cocaine-induced excited 
delirium (ED) and for a sample of those 
who had died from a cocaine overdose 
without excited delirium; survival time for 
people in both groups was at most 6 h. The 
accompanying data was read from a com- 
parative boxplot in the article “Fatal Exci- 
ted Delirium Following Cocaine Use” 
(J. Forensic Sci. 1997: 25-31). 


ED 0 0 0 0 1 l 
wl wl 2 2 3 3 
me! 4 f) 7 8 1.0 
1.5 27 28 35 40 89 
9.2 11.7 21.0 
Non-ED 0 0 0 0 0 1 


6 8 9 10 12 14 
15 #17 20 32 35 4.1 
43 48 50 56 59 6.0 
64 79 83 87 91 9.6 
99 11.0 115 12.2 12.7 14.0 

16.6 17.8 


a. Determine the medians, quartiles, and 
iqrs for the two samples. 

b. Are there any outliers in either sample? 
Any extreme outliers? 

c. Construct a comparative boxplot, and use 
it as a basis for comparing and contrast- 
ing the ED and non-ED samples. 
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At the beginning of the 2007 baseball season 
each American League team had nine starting 
position players (this includes the designated 
hitter but not the pitcher). Here are the salaries 
for the New York Yankees and the Cleveland 
Indians in thousands of dollars: 


Yankees: 12000 600 491 22709 21600 


Indians: 


70. 


13000 13000 15000 23429 
3200 3750 396 383 1000 
3750 917 3000 4050 


Construct a comparative boxplot and 
comment on interesting features. Compare 
the salaries of the two teams. (The Indians 
won more games than the Yankees in the 
regular season and defeated the Yankees in 
the playoffs.) 


The article “E-cigarettes as a Source of 
Toxic and Potentially Carcinogenic Metals” 
(Environ. Res. 2017: 221-225) reports the 
concentration (j1g/L) of cadmium, chro- 
mium, lead, manganese, and nickel in 10 
cartidges for each of five e-cigarette brands. 
Here are the lead levels in the 50 cartridges 
(some values were estimated from a graph 
in the article): 


72. Let x,.. 


Brand 
A 500 
1705 
B 3.53 
20.6 
Cc 7.94 
79.3 
D 317 
5.01 
E 4.50 
5.24 


623 
2190 
3.67 
34.0 
14.3 
156 
3.45 
5.23 
4.89 
6.43 


794 
3162 
3.98 
49.1 
23.8 
204 
4.21 
5.34 
4.99 
7.09 


1228 
3894 
10.2 
126 
44.2 
219 
4.56 
5.68 
5.02 
8.52 


1555 
4870 


16.4 


218 


59.3 


233 


4.95 
5.89 
5.06 
9.82 


43 


Emission Inventory Validation” (J. Envi- 
ron. Engr. 1995: 483-490). Discuss any 
interesting features. 


Gas vapor coefficient 


Time 


6am. 8am. 12noon 2p.m. 10 p.m. 


.,X, be a sample, and let a and b be 
constants. Define a new sample yj,..., Vn 
by y) = ax, +), ..., Vn = AX, +b. 


a. How do the sample variance and stan- 
dard deviation of the y,’s relate to the 
variance and standard deviation of the 
x;’s? Verify your conjectures. 

b. How does the ir of the y,’s relate to the iqr 
of the x;’s? Substantiate your assertion. 


Supplementary Exercises: (73-96) 


71. 


Because the values are on very different 
scales, it makes sense to take the loga- 
rithms of these values first. Apply a logio() 
transformation to these values, construct a 
comparative boxplot, and comment on 
what you find. 


The comparative boxplot below of gasoline 
vapor coefficients for vehicles in Detroit 
appeared in the article “Receptor Modeling 
Approach to [Volatile Organic Compound] 


73. The article “Correlation Analysis of Stenotic 


Males: 


Aortic Valve Flow Patterns Using Phase 
Contrast MRI” (Annals of Biomed. Engr. 
2005: 878-887) included the following data 
on aortic root diameter (cm) for a sample of 
patients having various degrees of aortic 
stenosis (i.e., narrowing of the aortic valve): 


3.7 34 3.7 40 3.9 38 34 36 3.1 
40 34 3.8 3.5 


Females: 3.8 2.6 3.2 3.0 4.3 35 3.1 3.1 


3.2 3.0 
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74. 


75. 


76. 


6.3 
9.0 
10.7 
11.4 


77. 


a. Create a comparative stem-and-leaf plot. 

b. Calculate an appropriate measure of 
center for each set of observations. 

c. Compare and contrast the diameter 
observations for the two sexes. 


Consider the following information from a 
sample of four Wolferman’s cranberry citrus 
English muffins, which are said on the 
package label to weigh 116 g: x = 104.4 g, 
s = 4.1497 g, smallest weighs 98.7 g, lar- 
gest weighs 108.0 g. Determine the values 
of the two middle sample observations (and 
don’t do it by successive guessing!). 


Three different C.F, flow rates (SCCM) 
were considered in an experiment to 
investigate the effect of flow rate on the 
uniformity (%) of the etch on a silicon 
wafer used in the manufacture of integrated 
circuits, resulting in the following data: 


Flow rate 

125 2.6 2.7 3.0 3.2 3.8 46 
160 3.6 42 42 46 4.9 5.0 
200 2.9 34 35 41 46 5.1 


Compare and contrast the uniformity 
observations resulting from these three 
different flow rates. 


The amount of radiation received at a 
greenhouse plays an important role in 
determining the rate of photosynthesis. The 
accompanying observations on incoming 
solar radiation were read from a graph in 
the article “Radiation Components over 
Bare and Planted Soils in a Greenhouse” 
(Solar Energy 1990: 1011-1016). 


6.4 Tel 8.4 8.5 8.8 8.9 
9.1 10.0 10.1 10.2 10.6 10.6 
10.7 10.8 10.9 11.1 11.2 11.2 
11.9 11.9 12.2 13.1 


Use some of the methods discussed in this 
chapter to describe and summarize this data. 


The article “Motor Vehicle Emissions 
Variability” (J. Air Waste Manag. Assoc. 
1996: 667-675) reported the following 


78. 


52.9 
59.8 
47.3 
51.3 
53.9 
53.7 
66.4 
45.9 
55.0 
46.9 
81.6 
54.4 
49.7 
51.4 
58.3 
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hydrocarbon and carbon dioxide measure- 
ments using the Federal Testing Procedure 
for emissions-testing, applied four times 
each to the same car: 


13.8 
118 


18.3 
149 


32.2 
232 


32:5 
236 


HC (g/mile): 
CO (g/mile): 


a. Compute the sample standard deviations 
for the HC and CO observations. Why 
should it not be surprising that the CO 
measurements have a larger standard 
deviation? 

b. The sample coefficient of variation s/X 
(or 100-s/X) assesses the extent of 
variability relative to the mean. Values 
of this coefficient for several different 
data sets can be compared to determine 
which data sets exhibit more or less 
variation. Carry out such a comparison 
for the given data. 


The cost-to-charge ratio for a hospital is 
the ratio of the actual cost of care to what 
the hospital charges for that care. In 2008, 
the Kentucky Department of Health and 
Family Services reported the following 
cost-to-charge ratios, expressed as percents, 
for 116 Kentucky hospitals: 


49.7 58.1 414 665 441 53.0 49.1 
47.1 443 523 605 59.9 47.1 62.4 
62.1 52.1 47.8 65.1 42.9 385 65.9 
52.6 449 47.8 60.2 564 676 31.9 
50.6 725 47.8 50.5 25.1 45.0 86.0 
612 634 51.5 48.6 42.1 49.3 50.0 
64.6 474 48.1 45.8 64.7 58.7 56.9 
82.9 460 51.0 67.0 49.3 69.5 56.5 
39.2 85.0 46.7 41.6 45.4 71.2 42.7 
39.2 55.3 46.1 43.2 67.7 60.6 68.2 
39.2 547 63.5 67.9 50.9 40.4 49.0 
39.2 43.2 43.2 51.7 484 50.7 59.4 
60.2 40.2 62.3 414 486 45.6 46.2 
65.3 31.5 506 414 82.3 45.2 46.0 
46.3 38.2 59.1 


(For example, a cost-to-charge ratio of 
53.0% means the actual cost of care is 53% 
of what the hospital charges.) Use various 
techniques discussed in this chapter to 
organize, summarize, and describe the data. 
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80. 


81. 


82 
108 
115 
136 


82. 


31 
Al 
AS 
54 


Fifteen air samples from a certain region 
were obtained, and for each one the carbon 
monoxide concentration was determined. 
The results (in ppm) were 


15.6 9.2 
12.1 9.8 


10.7 8.5 9.6 10.5, 


13:2 11.0 8.8 


12.2 
13.7 


Using the interpolation method suggested 
in Section 1.3, compute the 10% trimmed 
mean. 


a. For what value of c is the quantity 
> (x; — c)? minimized? [Hint: Take the 
derivative with respect to c, set equal to 
0, and solve.] 

b. Using the result of part (a), which of 


the two quantities S>(x;—x)° and 
S> (x; — u)° will be smaller than the 


other (assuming that x ~ jy)? 


The article “A Longitudinal Study of the 
Development of Elementary School Chil- 
dren’s Private Speech” (Merrill-Palmer Q. 
1990: 443-463) reported on a study of 
children talking to themselves (private 
speech). It was thought that private speech 
would be related to IQ, because IQ is sup- 
posed to measure mental maturity, and it 
was known that private speech decreases 
as students progress through the primary 
grades. The study included 33 stu- 
dents whose first-grade IQ scores are given 
here: 


96 99 102 103 103 106 107 108 = 108 
108 109 110 110 11 113 113 113 113 
115) 118 «#118 119 121 122 122 127 132 
140 146 


Use various techniques discussed in this 
chapter to organize, summarize, and describe 
the data. 


The accompanying specific gravity values 
for various wood types used in construction 
appeared in the article “Bolted Connection 
Design Values Based on European Yield 
Model” (VJ. Struct. Engr. 1993: 2169-2186): 


35 36 36 ist 38 40 40 40 
Al 42 42 42 42. 42 43 44 
46 46 47 A8 48 A8 51 54 
39 58 62 66 66 .67 68 af 
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45 
Construct a stem-and-leaf display using 
repeated stems, and comment on any 
interesting features of the display. 


In recent years, some evidence suggests 
that high indoor radon concentration may 
be linked to the development of childhood 
cancers, but many health professionals 
remain unconvinced. The article “Indoor 
Radon and Childhood Cancer” (Lancet 
1991: 1537-1538) presented the accompa- 
nying data on radon concentration (Bq/m*) 
in two different samples of houses. The first 
sample consisted of houses in which a child 
diagnosed with cancer had been residing. 
Houses in the second sample had no 
recorded cases of childhood cancer. 


Cancer: 3 5 6 7 8 9 9 10 10 10 


11 ott 11 1) 12 13: 13: «15 15 15 
16 16 16 17 18 18 18 20 21 21 
22 22 23 23 27 33 34 38 39 45 


No cancer: 3 3 5 6 6 7 7 7 8 8 


9 9 9 9 11 11 11 1 11 12 
12 13 14 17 17 21 21 24 24 29 
29, 29 29 33 38 39 55 55 85 


a. Construct a side-by-side stem-and-leaf 
display, and comment on any interesting 
features. 

b. Calculate the standard deviation of each 
sample. Which sample appears to have 
greater variability, according to these 
values? 

c. Calculate the iqr for each sample. Now 
which sample has greater variability, and 
why is this different than the result of part 
(b)? 


84. Elevated energy consumption during exer- 


cise continues after the workout ends. 
Because calories burned after exercise con- 
tribute to weight loss and have other conse- 
quences, it is important to understand this 
process. The paper “Effect of Weight 
Training Exercise and Treadmill Exercise on 
Post-Exercise Oxygen Consumption” (Med. 
Sci. Sports Exercise 1998: 518-522) repor- 
ted the accompanying data from a study in 
which oxygen consumption (liters) was 
measured continuously for 30 min for each 
of 15 subjects both after a weight training 
exercise and after a treadmill exercise. 


46 


a. Construct side-by-side boxplots of the 
weight and treadmill observations, and 
comment on what you see. 

b. Because the data is in the form of 
(x, y) pairs, with x and y measurements 
on the same variable under two different 
conditions, it is natural to focus on the 
differences within pairs: d, = x, — yy, 
sees An = Xp — Yn. Construct a boxplot of 
the sample differences. What does it 
suggest? 


Subject 1 2 3 4 5) 6 


Weight (x) 
Treadmill (y) 


24.3 
15.2 


16.3 
10.1 


22.1 
19.6 


Subject 7 8 9 10 11 12 


Weight (x) 
Treadmill (y) 


17.0 
10.3 2.6 


19.1 
16.6 


19.6 
22.4 


Subject 13 14 15 


Weight (x) 
Treadmill (y) 


85. 


86. 


23.2 
23.6 


18.5 15.9 
12.6 4.4 


Anxiety disorders and symptoms can often 
be effectively treated with benzodiazepine 
medications. It is known that animals 
exposed to stress exhibit a decrease in 
benzodiazepine receptor binding in the 
frontal cortex. The paper “Decreased Ben- 
zodiazepine Receptor Binding in Prefrontal 
Cortex in Combat-Related Posttraumatic 
Stress Disorder” (Amer. J. Psychiatry 2000: 
1120-1126) described the first study of 
benzodiazepine receptor binding in indi- 
viduals suffering from PTSD. The accom- 
panying data on a receptor binding measure 
(adjusted distribution volume) was read 
from a graph in the paper. 


PTSD: 10 20 25 28 31 35 37 38 38 39 39 42 46 
Healthy: 23 39 40 41 43 47 51 58 63 66 67 69 72 


Use various methods from this chapter to 
describe and summarize the data. 


The article “Can We Really Walk 
Straight?” (Amer. J. Phys. Anthropol. 1992: 
19-27) reported on an experiment in which 
each of 20 healthy men was asked to walk 
as straight as possible to a target 60 m away 
at normal speed. Consider the following 


95 
78 
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observations on cadence (number of strides 
per second): 


85 92 95 .93 .86 
93 93 1.05 .93 1.06 


100 .92 85 .81 
106 .96 81 .96 


Use the methods developed in this chapter to 
summarize the data; include an interpreta- 
tion or discussion wherever appropriate. 
[Note: The author of the article used a rather 
sophisticated statistical analysis to conclude 
that people cannot walk in a straight line and 
suggested several explanations for this.] 


87. The mode of a numerical data set is the 


value that occurs most frequently in the set. 


a. Determine the mode for the cadence 
data given in the previous exercise. 

b. For a categorical sample, how would 
you define the modal category? 


88. Specimens of three different types of rope 


wire were selected, and the fatigue limit 
(MPa) was determined for each specimen, 
resulting in the accompanying data. 


Type 1: 350 350 350 358 370 370 370 371 


371 372 372 384 391 391 392 


Type 2: 350 354 359 363 365 368 369 371 


373 «374 «69376 «6380 «6383 «388392 


Type 3: 350 361 362 364 364 365 366 371 


377 377 «377: =—-379- 3380 380 392 


a. Construct a comparative boxplot, and 
comment on similarities and differences. 

b. Construct a comparative dotplot (a 
dotplot for each sample with a common 
scale). Comment on similarities and 
differences. 

c. Does the comparative boxplot of part 
(a) give an informative assessment of 
similarities and differences? Explain 
your reasoning. 


89. The three measures of center introduced in 


this chapter are the mean, median, and 
trimmed mean. Two additional measures of 
center that are occasionally used are the 
midrange, which is the average of the 
smallest and largest observations, and the 
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0.41 
0.58 
1.02 
1.17 
1.68 
2.49 
4.75 


midquarter, which is the average of the two 
quartiles. Which of these five measures of 
center are resistant to the effects of outliers 
and which are not? Explain your reasoning. 


The authors of the article “Predictive Model 
for Pitting Corrosion in Buried Oil and Gas 
Pipelines” (Corrosion 2009: 332-342) 
provided the data on which their investi- 
gation was based. 


a. Consider the following sample of 61 
observations on maximum pitting depth 
(mm) of pipeline specimens buried in 
clay loam soil. 


0.41 
0.79 
1.04 
1.19 
1.91 
2:57 
9.33: 


0.41 
0.79 
1.04 
1.19 
1.96 
2.74 
7.65 


0.41 
0.81 
1.17 
1.27 
1.96 
3.10 
7.70 


0.43 
0.81 
1.17 
1.40 
1.96 
3.18 
8.13 


0.43 
0.81 
1.17 
1.40 
2.10 
3.30 
10.41 


0.43 
0.91 
1.17 
1.59 
2.21 
3.58 
13.44 


0.48 
0.94 
1.17 
1.59 
2.31 
3.58 


0.48 
0.94 
1.17 
1.60 
2.46 
4.15 


Construct a stem-and-leaf display in which 
the two largest values are shown in a last 
row labeled HI. 


b. Refer back to (a), and create a his- 
togram based on eight classes with O as 
the lower limit of the first class and 
class widths of .5, .5, .5, .5, 1, 2, 5, and 
5, respectively. 

c. The accompanying comparative boxplot 
shows plots of pitting depth for four 
different types of soils. Describe its 
important features. 


Maximum pit depth 


oo} 
L 


ae 


sath 


Soil type 
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91. Consider a sample x;, x2, ..., X, and sup- 
pose that the values of x, 8, and s have 
been calculated. 


a. Let y; = x; — x fori = 1, ...,n. How do 
the values of s* and s for the y,’s com- 
pare to the corresponding values for the 
x;’s? Explain. 

b. Let z = (x; — x)/s fori = 1, ...,n. What are 
the values of the sample variance and 
sample standard deviation for the z;’s? 


92. Let x, and se denote the sample mean and 


variance for the sample x,, ...,x,, and letx,,+ 1 


and . ,, denote these quantities when an 
additional observation x,,,; is added to the 
sample. 


a. Show how x, , can be computed from 


Xn and X41. 
b. Show that 

2 2 = 2 
ns, 4, =(n—-1)s,+ oa (%n+1 — Xn) 

so. that c 4, can be computed from 


Xne1, Xn, and 

c. Suppose that a sample of 15 strands of 
drapery yarn has resulted in a sample 
mean thread elongation of 12.58 mm 
and a sample standard deviation of 
512 mm. A 16th strand results in an 
elongation value of 11.8. What are the 
values of the sample mean and sample 
standard deviation for all 16 elongation 
observations? 


93. Lengths of bus routes for any particular 
transit system will typically vary from one 
route to another. The article “Planning of 
City Bus Routes” (J. Institut. Engr. 1995: 
211-215) gives the following information 
on lengths (km) for one particular system: 


Length: 6-8 8-10 10-12 12-14 14-16 
Frequency: 6 23 30 35 32 
Length: 16-18 18-20 20-22 22-24 24-26 
Frequency: 48 42 40 28 27 
Length: 26-28 28-30 30-35 35-40 40-45 
Frequency: 26 14 27 11 2 
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a. Draw a histogram corresponding to 
these frequencies. 

b. What proportion of these route lengths 
are less than 20? What proportion of 
these routes have lengths of at least 30? 

c. Roughly what is the value of the 90th 
percentile of the route length distribution? 

d. Roughly what is the median route 
length? 


A study carried out to investigate the dis- 
tribution of total braking time (reaction 
time plus accelerator-to-brake movement 
time, in msec) during real driving condi- 
tions at 60 km/h gave the following sum- 
mary information on the distribution of 
times (“A Field Study on _ Braking 
Responses during Driving,” Ergonomics 
1995: 1903-1910): 


mean = 535 median = 500 mode = 500 
sd = 96 minimum = 220 maximum = 925 
5th percentile = 400 10th percentile = 430 
90th percentile = 640 95th percentile = 720 


What can you conclude about the shape of 
a histogram of this data? Explain your 
reasoning. 


The sample data x;, x2, ..., X, sometimes 
represents a time series, where x, = the 
observed value of a response variable x at 
time ¢. Often the observed series shows a 
great deal of random variation, which makes 
it difficult to study longer-term behavior. In 
such situations, it is desirable to produce a 
smoothed version of the series. One tech- 
nique for doing so involves exponential 
smoothing. The value of a smoothing con- 
stant « is chosen (0 < a < 1). Then with x, 
defined as the smoothed value at time ft, we 
set X} = x,, and for f= 2, 3, ..., n, X= 
ox, + (1 — )Xy-4. 


a. Consider the following time series in 
which x, = temperature (°F) of effluent 
at a sewage treatment plant on day t: 47, 
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54, 53, 50, 46, 46, 47, 50, 51, 50, 46, 
52, 50, 50. Plot each x, against t on a 
two-dimensional coordinate system (a 
time series plot). Does there appear to 
be any pattern? 

b. Calculate the x,’s using « = .1. Repeat 
using & = .5. Which value of « gives a 
smoother x; series? 

c. Substitute  X,-; = ox,-1 + (1 — a)X_2 
on the right-hand side of the expression 
for x,, then substitute x,. in terms of 
X;-2 and x;_3, and so on. On how many 
of the values x,, x;1, ..., x; does x; 
depend? What happens to the coefficient 
on x,-; as k increases? 

d. Refer to part (c). If t is large, how sen- 
sitive is Xx, to the initialization x; = x)? 
Explain. 


96. Consider numerical observations x1, ...,X,. 


It is frequently of interest to know whether 
the x;’s are (at least approximately) sym- 
metrically distributed about some value. If 
n is at least moderately large, the extent of 
symmetry can be assessed from a stem-and- 
leaf display or histogram. However, if n is 
not very large, such pictures are not partic- 
ularly informative. Consider the follow- 
ing alternative. Let y, denote the smallest 
Xi, Y2 the second-smallest x;, and so on. 
Then plot the following pairs as points on a 
two-dimensional coordinate system: (y, — <x, 
¥— 1) na — 5,4 = Yo), On — 3,4 = Ys), 
.... There are n/2 points when n is even and 
(n — 1)/2 when n is odd. 


a. What does this plot look like when there 
is perfect symmetry in the data? What 
does it look like when observations 
stretch out more above the median than 
below it (a long upper tail)? 

b. Construct the plot for the nitrogen data 
presented in Example 1.17, and com- 
ment on the extent of symmetry or nat- 
ure of departure from symmetry. 


®) 


Check for 
updates 


Introduction 

The term probability refers to the study of randomness and uncertainty. In any situation in which one 
of a number of possible outcomes may occur, the theory of probability provides methods for 
quantifying the chances, or likelihoods, associated with the various outcomes. The language of 
probability is constantly used in an informal manner in both written and spoken contexts. Examples 
include such statements as “It is likely that the Dow Jones Industrial Average will increase by the end 
of the year,” “There is a 50-50 chance that the incumbent will seek reelection,” “There will probably 
be at least one section of that course offered next year,” “The odds favor a quick settlement of the 
strike,” and “It is expected that at least 20,000 concert tickets will be sold.” In this chapter, we 
introduce some elementary probability concepts, indicate how probabilities can be interpreted, and 
show how the rules of probability can be applied to compute the probabilities of many interesting 
events. The methodology of probability will then permit us to express in precise language such 
informal statements as those given above. 

The study of probability as a branch of mathematics goes back over 300 years, where it had its 
genesis in connection with questions involving games of chance. Many books are devoted exclusively 
to probability and explore in great detail numerous interesting aspects and applications of this lovely 
branch of mathematics. Our objective here is more limited in scope: We will focus on those topics 
that are central to a basic understanding and also have the most direct bearing on problems of 
statistical inference. 


2.1 Sample Spaces and Events 


In probability, an experiment refers to any action or activity whose outcome is subject to uncertainty. 
Although the word experiment generally suggests a planned or carefully controlled laboratory testing 
situation, we use it here in a much wider sense. Thus experiments that may be of interest include 
tossing a coin once or several times, selecting a card or cards from a deck, weighing a loaf of bread, 
measuring the commute time from home to work on a particular morning, determining blood types 
from a group of individuals, or calling people to conduct a survey. 
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The Sample Space of an Experiment 


DEFINITION The sample space of an experiment, denoted by ¥, is the set of all possible 
outcomes of that experiment. 


Example 2.1 The simplest experiment to which probability applies is one with two possible out- 
comes. One such experiment consists of examining a single fuse to see whether it is defective. The 
sample space for this experiment can be abbreviated as “= {N, D}, where N represents not defective, 
D represents defective, and the braces are used to enclose the elements of a set. Another such 
experiment would involve tossing a thumbtack and noting whether it landed point up or point down, 
with sample space S= {U, D}, and yet another would consist of observing the sex assigned to the next 
child born at the local hospital, with £= {M, F}. a 


Example 2.2 If we examine three fuses in sequence and note the result of each examination, then an 
outcome for the entire experiment is any sequence of N’s and D’s of length 3, so 


S= {NNN, NND, NDN, NDD, DNN, DND, DDN, DDD} 


If we had tossed a thumbtack three times, the sample space would be obtained by replacing N by U in 
§ above. A similar notational change would yield the sample space for the experiment in which the 
assigned sexes of three newborn children are observed. a 


Example 2.3 Two gas stations are located at a certain intersection. Each one has six gas pumps. 
Consider the experiment in which the number of pumps in use at a particular time of day is observed 
for each of the stations. An experimental outcome specifies how many pumps are in use at the first 
station and how many are in use at the second one. One possible outcome is (2, 2), another is (4, 1), 
and yet another is (1, 4). The 49 outcomes in S are displayed in the accompanying table. 


Second station 


First station 0 1 2 3 4 5 6 

0 (0, 0) (O, 1) (0, 2) (O, 3) (O, 4) (0, 5) (0, 6) 
1 (1, 0) d, 1) di, 2) d, 3) d, 4) dd, 5) (1, 6) 
2 (2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6) 
3 (3, 0) (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6) 
4 (4, 0) (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6) 
5 (5, 0) (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6) 
6 (6, 0) (6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6) 


The sample space for the experiment in which a six-sided die is thrown twice results from deleting 
the 0 row and 0 column from the table, giving 36 outcomes. i 


Example 2.4 If a new cell phone battery has a voltage that is outside certain limits, that battery is 
characterized as a failure (F); if the battery has a voltage within the prescribed limits, it is a success 
(S). Suppose an experiment consists of testing each battery as it comes off an assembly line until we 
first observe a success. Although it may not be very likely, a possible outcome of this experiment is 
that the first 10 (or 100 or 1000 or ...) are F’s and the next one is an S. That is, for any positive integer 
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n, we may have to examine n batteries before seeing the first S$. The sample space is 
&= {S, FS, FFS, FFFS, ...}, which contains an infinite number of possible outcomes. The same 
abbreviated form of the sample space is appropriate for an experiment in which, starting at a specified 
time, the sex of each newborn infant at a hospital is recorded until the birth of a female is observed. Hf 


Events 
In our study of probability, we will be interested not only in the individual outcomes but also in 
various collections of outcomes from S. 


DEFINITION An event is any collection (subset) of outcomes contained in the sample space &. 
An event is said to be simple if it consists of exactly one outcome and compound 
if it consists of more than one outcome. 


When an experiment is performed, a particular event A is said to occur if the resulting experimental 
outcome is contained in A. In general, exactly one simple event will occur, but many compound 
events will occur simultaneously. 


Example 2.5 Consider an experiment in which each of three vehicles taking a particular freeway 
exit turns left (Z) or right (R) at the end of the off-ramp. The eight possible outcomes that comprise the 
sample space are LLL, RLL, LRL, LLR, LRR, RLR, RRL, and RRR. Thus there are eight simple events, 
among which are EF, = {LLL} and Es = {LRR}. Some compound events include 


A = {RLL, LRL, LLR} = the event that exactly one of the three vehicles turns right 
B= {LLL, RLL, LRL, LLR} = the event that at most one of the vehicles turns right 
C = {LLL, RRR} = the event that all three vehicles turn in the same direction. 


Suppose that when the experiment is performed, the outcome is LLL. Then the simple event E, has 
occurred and so also have the events B and C, but not A. | 


Example 2.6 (Example 2.3 continued) When the number of pumps in use at each of two six-pump 
gas stations is observed, there are 49 possible outcomes, so there are 49 simple events: E; = {(0,0)}, 
Ey = {(0, 1)},..., £49 = {(6,6)}. Examples of compound events are 


A = {(0, 0), C, 1), @, 2), G, 3), (4, 4), (, 5), (6, 6)} = the event that the number of pumps in use is 
the same for both stations 

B= {(0, 4), C, 3), (@, 2), G, 1), (4, 0)} = the event that the total number of pumps in use is four 

C = {(0, 0), (0, 1), (1, 0), 1, 1)} = the event that at most one pump is in use at each station. Hf 


Example 2.7 (Example 2.4 continued) The sample space for the cell phone battery experiment 
contains an infinite number of outcomes, so there are an infinite number of simple events. Compound 
events include 


A = {S, FS, FFS} = the event that at most three batteries are examined 
B= {S, FFS, FFFFS} = the event that exactly one, three, or five batteries are examined 
C = {FS, FFFS, FFFFFS, ...} = the event that an even number of batteries are examined. |_| 


Some Relations from Set Theory 
An event is nothing but a set, so relationships and results from elementary set theory can be used to 
study events. The following operations will be used to construct new events from given events. 
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DEFINITION 1. The complement of an event A, denoted by A’, is the set of all outcomes in & that 
are not contained in A. 

2. The intersection of two events A and B, denoted by AM B and read “A and B,” is 
the event consisting of all outcomes that are in both A and B. 

3. The union of two events A and B, denoted by AUB and read “A or B,” is the 
event consisting of all outcomes that are either in A or in B or in both events (so 
that the union includes outcomes for which both A and B occur as well as 
outcomes for which exactly one occurs)—that is, all outcomes in at least one of 
the events. 


Example 2.8 (Example 2.3 continued) For the experiment in which the number of pumps in use at a 
single six-pump gas station is observed, let A = {0, 1, 2, 3, 4}, B = {3, 4, 5, 6}, and C = {1, 3, 5}. 
Then 

AU B= {0, 1,2,3,4,5,6$=8 AUC= {0,1,2, 3,4, 5} 

AM B= {3, 4} AM C= {1, 3} '={5,6} (AUO)'= {6} ial 


Example 2.9 (Example 2.4 continued) In the cell phone battery experiment, define A, B, and C as in 
Example 2.7. Then 


AUB = ({S, FS, FFS, FFFFS} 
ANB = {S, FFS} 
A! = {FFFS, FFFFS, FFFFFS, ...} 


and 


C' = {S, FFS, FFFFS, ...} = the event that an odd number of batteries are examined. | 


The complement, intersection, and union operators from set theory correspond to the not, and, and or 
operators from computer science. Readers with prior programming experience may be aware of an 
important relationship between these three operators, first discovered by 19-century British mathe- 
matician Augustus De Morgan. 


DE MORGAN’S LAWS Let A and B be two events in the sample space of some experiment. 
Then 


1.(AUB)’ = A’NB’ 


2.(ANB)’ = A'UB 


De Morgan’s laws state that the complement of a union is an intersection, and the complement of an 
intersection is a union (see Exercise 11). 

Sometimes A and B have no outcomes in common, so that the intersection of A and B contains no 
outcomes. 
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DEFINITION When A and B have no outcomes in common, they are said to be disjoint or 
mutually exclusive events. Mathematicians write this compactly as ANB = J, 
where @ denotes the event consisting of no outcomes whatsoever (the “null” or 


“empty” event). 


Example 2.10 A small city has three automobile dealerships: a GM dealer selling Chevrolets and 
Buicks; a Ford dealer selling Fords and Lincolns; and a Chrysler dealer selling Rams and Jeeps. If an 
experiment consists of observing the brand of the next vehicle sold, then the events A = {Chevrolet, 
Buick} and B = {Ford, Lincoln} are mutually exclusive, because the next vehicle sold cannot be both 
a GM product and a Ford product. a 


Venn diagrams are often used to visually represent sample spaces and events. To construct a 
Venn diagram, draw a rectangle whose interior will represent the sample space ’. Then any event A is 
represented as the interior of a closed curve (often a circle) contained in £. Figure 2.1 shows examples 
of Venn diagrams. 


“co'|[*@’| | ‘eo’ role. 


(c) Shaded region (d) Shaded region 
isAUB is A’ 


(e) Mutually exclusive 
events 


(a) Venn diagram of 
events A and B 


(b) Shaded region 
isANB 


Figure 2.1 Venn diagrams 


The operations of union and intersection can be extended to more than two events. For any three 
events A, B, and C, the event AM BMC is the set of outcomes contained in all three events, whereas 
AUBUC is the set of outcomes contained in at least one of the three events. A collection of several 
events is said to be mutually exclusive (or pairwise disjoint) if no two events have any outcomes in 
common. De Morgan’s laws also extend; e.g. (AUBUC)’ = A'NB'NC.. 


Exercises Section 2.1 (1-12) 


1. Ann and Bev have each applied for several a. List all elements of &. 
jobs at a local university. Let A be the event b. List all outcomes in the event A that Al 
that Ann is hired, and let B be the event that and Bill make the same choice. 


Bev is hired. Express in terms of A and c. List all outcomes in the event B that 
B the following events: neither of them votes for candidate 2. 


a. Ann is hired but not Bev. 
b. At least one of them is hired. 
c. Exactly one of them is hired. 


. Two voters, Al and Bill, are each choosing 


between one of three candidates—1, 2, and 
3—who are running for city council. An 
experimental outcome specifies both Al’s 
choice and Bill’s choice, e.g., the pair (3, 2). 


. Four universities—1, 2, 3, and 4—are par- 


ticipating in a holiday basketball tournament. 
In the first round, | will play 2 and 3 will play 
4. Then the two winners will play for the 
championship, and the two losers will also 
play. One possible outcome can be denoted 
by 1324: 1 beats 2 and 3 beats 4 in first-round 
games, and then | beats 3 and 2 beats 4. 
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a. List all outcomes in S. 

b. Let A denote the event that 1 wins the 
tournament. List outcomes in A. 

c. Let B denote the event that 2 gets into 
the championship game. List outcomes 
in B. 

d. What are the outcomes in AUB and in 
AB? What are the outcomes in A"? 


Suppose that vehicles taking a particular 
freeway exit can turn right (R), turn left (Z), 
or go straight (S). Consider observing the 
direction for each of three successive 
vehicles. 


a. List all outcomes in the event A that all 
three vehicles go in the same direction. 

b. List all outcomes in the event B that all 
three vehicles take different directions. 

c. List all outcomes in the event C that 
exactly two of the three vehicles turn 
right. 

d. List all outcomes in the event D that 
exactly two vehicles go in the same 
direction. 

e. List the outcomes in D’', CUD, and 
CN D. 


Three components are connected to form a 
system as shown in the accompanying 
diagram. Because the components in the 
2-3 subsystem are connected in parallel, 
that subsystem will function if at least one 
of the two individual components func- 
tions. For the entire system to function, 
component | must function and so must the 
2-3 subsystem. 


The experiment consists of determining the 
condition of each component: S (success) for 
a functioning component and F (failure) for a 
nonfunctioning component. 


2 Probability 


a. What outcomes are contained in the 
event A that exactly two out of the three 
components function? 

b. What outcomes are contained in the 
event B that at least two of the compo- 
nents function? 

c. What outcomes are contained in the 
event C that the system functions? 

d. List outcomes in C’, AUC, ANC, 
BUC, and BNC. 


. Each of a sample of four home mortgages 


is classified as fixed rate (F) or variable 
rate (V). 


a. What are the 16 outcomes in S? 

b. Which outcomes are in the event that 
exactly three of the selected mortgages 
are fixed rate? 

c. Which outcomes are in the event that all 
four mortgages are of the same type? 

d. Which outcomes are in the event that at 
most one of the four is a variable rate 
mortgage? 

e. What is the union of the events in parts 
(c) and (d), and what is the intersection 
of these two events? 

f. What are the union and intersection of 
the two events in parts (b) and (c)? 


. A family consisting of three persons—A, B, 


and C—belongs to a medical clinic that 
always has a doctor at each of stations 1, 2, 
and 3. During a certain week, each member 
of the family visits the clinic once and is 
assigned at random to a station. The 
experiment consists of recording the station 
number for each member. One outcome is 
(1, 2, 1) for A to station 1, B to station 2, 
and C to station 1. 


a. List the 27 outcomes in the sample 
space. 

b. List all outcomes in the event that all 
three members go to the same station. 

c. List all outcomes in the event that all 
members go to different stations. 

d. List all outcomes in the event that no 
one goes to station 2. 


2.1 


8. 


10. 


Sample Spaces and Events 


A college library has five copies of a certain 
text on reserve. Copies | and 2 are first 
printings, whereas 3, 4, and 5 are second 
printings. A student examines these books 
in random order, stopping only when a 
second printing has been selected. One 
possible outcome is 5, and another is 213. 


a. List the outcomes in &. 

b. Let A denote the event that exactly one 
book must be examined. What out- 
comes are in A? 

c. Let B be the event that book 5 is the one 
selected. What outcomes are in B? 

d. Let C be the event that book 1 is not 
examined. What outcomes are in C? 


. An academic department has just com- 


pleted voting by secret ballot for a depart- 
ment head. The ballot box contains four 
slips with votes for candidate A and three 
slips with votes for candidate B. Suppose 
these slips are removed from the box one 
by one. 


a. List all possible outcomes. 

b. Suppose a running tally is kept as slips 
are removed. For what outcomes does 
A remain ahead of B throughout the 
tally? 

A construction firm is currently working on 

three different buildings. Let A; denote the 

event that the ith building is completed by 

the contract date. Use the operations of 


11. 


12. 


55 


union, intersection, and complementation to 
describe each of the following events in 
terms of A;, A>, and A3, draw a Venn dia- 
gram, and shade the region corresponding 
to each one. 


a. At least one building is completed by 
the contract date. 

b. All buildings are completed by the 
contract date. 

c. Only the first building is completed by 
the contract date. 

d. Exactly one building is completed by 
the contract date. 

e. Either the first building or both of the 
other two buildings are completed by 
the contract date. 


Use Venn diagrams to verify De Morgan’s 
laws: 

a. (AUB)’ =A'NB’ 

b. (ANB)! =A'UB' 


a. In Example 2.10, identify three events 
that are mutually exclusive. 

b. Suppose there is no outcome common 
to all three of the events A, B, and 
C. Are these three events necessarily 
mutually exclusive? If your answer is 
yes, explain why; if your answer is no, 
give a counterexample using the exper- 
iment of Example 2.10. 


2.2 Axioms, Interpretations, and Properties of Probability 


Given an experiment and its sample space ¥, the objective of probability is to assign to each event A a 
number P(A), called the probability of the event A, which will give a precise measure of the chance 
that A will occur. To ensure that the probability assignments will be consistent with our intuitive notions 
of probability, all assignments must satisfy the following axioms (basic properties) of probability. 


AXIOM 1 


For any event A, P(A) = 0. 


AXIOM2 P(S)=1. 


AXIOM3 


If Aj, A2, A3,... is an infinite collection of disjoint events, then 


P(A, U4, U 43 U-) = > P(A) 


i=l 
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Axiom | reflects the intuitive notion that the chance of A occurring should be nonnegative. The 
sample space is by definition the event that must occur when the experiment is performed (/ contains 
all possible outcomes), so Axiom 2 says that the maximum possible probability of 1 is assigned to . 
The third axiom formalizes the idea that if we wish the probability that at least one of a number of 
events will occur, and no two of the events can occur simultaneously, then the chance of at least one 
occurring is the sum of the chances of the individual events. 

You might wonder why the third axiom contains no reference to a finite collection of disjoint 
events. It is because the corresponding property for a finite collection can be derived from our three 
axioms. We want our axiom list to be as short as possible and not contain any property that can be 
derived from others on the list. 


PROPOSITION P(@) = 0, where @ is the null event. This, in turn, implies that the property 
contained in Axiom 3 is valid for a finite collection of events. 


Proof First consider the infinite collection A; = 0, Ay = GW, A3 = W, .... Since ONG = WY, the 
events in this collection are disjoint and UA; = @. Axiom 3 then gives 


P(D) = 9° P(B) 


This can happen only if P() = 0. 

Now suppose that A,, Ao, ..., Ay are disjoint events, and append to these the infinite collection 
Art = ©, Agia =D, Au3 =O, .... Then the events A;, Ao, ..., Ag Agii,-.. are disjoint, since 
AN © = © for all events. Again invoking Axiom 3, 


(Us) -°(Us) = 5 P(Ai) => ij . pals 5 P(A,) + 3 O= 57 PtA) 


i=l i=l i=k+1 i=l i=k+1 i=l 
as desired. 8 


Example 2.11 Consider evaluating a refurbished hard drive with a certifier. The certifier either 
deems the drive acceptable (the outcome A) or unacceptable (the outcome U). The sample space for 
this event is therefore § = {A, U}. The axioms specify P(£) = 1, so the probability assignment will be 
completed by determining P(A) and P(U). Since A and U are disjoint and their union is ¥, the 
foregoing proposition implies that 

1 = P(S) = P(A) + P(U) 


It follows that P(U) = 1 — P(A). One possible assignment of probabilities is P(A) = .5, P(U) = .5, 
whereas another possible assignment is P(A) = .75, P(U) = .25. In fact, letting p represent any fixed 
number between 0 and 1, P(A) = p, P(U) = 1 — p is an assignment consistent with the axioms. 


Example 2.12 (Example 2.4 continued) Consider testing cell phone batteries coming off an 
assembly line one by one until a battery having a voltage within prescribed limits is found. The 
simple events are E,; = {S}, E> = {FS}, E3 = {FFS}, E, = {FFFS}, .... Suppose the probability of 
any particular battery being satisfactory is .99. Then it can be shown that the probability assignment 
P(E) = .99, P(E>) = (.01)(.99), P(E3) = (.01)°(.99), ... satisfies the axioms. In particular, because the 
E;’s are disjoint and §= E; UY E.U E3 U ..., Axioms 2 and 3 require that 
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1 = P(£) = P(E)) + P(E2) + P(E3) ++ + + =.99[1 + .01 +(.01)? + (017 +- - -] 
This can be verified using the formula for the sum of a geometric series: 


2 3 7 
at+ar+ar +ar +-++= 
l-r 
However, another legitimate (according to the axioms) probability assignment of the same “geometric” 
type is obtained by replacing .99 by any other number p between 0 and 1 (and .01 by 1 — p). a 


Interpreting Probability 

Examples 2.11 and 2.12 show that the axioms do not completely determine an assignment of 
probabilities to events. The axioms serve only to rule out assignments inconsistent with our intuitive 
notions of probability. In the certifier experiment of Example 2.11, two particular assignments were 
suggested. The appropriate or correct assignment depends on the nature of the refurbished hard drives 
and also on one’s interpretation of probability. The interpretation that is most frequently used and 
most easily understood is based on the notion of relative frequencies. 

Consider an experiment that can be repeatedly performed in an identical and independent fashion, 
and let A be an event consisting of a fixed set of outcomes of the experiment. Simple examples of 
such repeatable experiments include the tack-tossing and die-rolling experiments previously dis- 
cussed. If the experiment is performed 7 times, on some of the replications the event A will occur (the 
outcome will be in the set A), and on others, A will not occur. Let n(A) denote the number of 
replications on which A does occur. Then the ratio n(A)/n is called the relative frequency of occur- 
rence of the event A in the sequence of n replications. 

For example, let A be the event that a flight arrives on time at a certain airport. The results of ten 
such flights (the first ten replications) might be as follows. 


Flight 1 2 3 4 5 6 7 8 9 10 
On time Y Y N Y N Y Y Y Y N 
(did A occur)? 

Relative 1 1 .667 75 6 .667 714 75 .778 a 


frequency of A 
Figure 2.2 shows the relative frequency, n(A)/n, of on-time arrivals as n increases. We see that the 


relative frequency fluctuates a lot for smaller values of n (this is also visible in the table); but, as 
n increases, the relative frequency appears to stabilize. 


10 10. 
\ | Approaches .72 

“ Aye ee 

06 


04 


Relative frequency of ontime arrival 
Relative frequency of ontime arrival 


0.0 0.0 
2 4 6 8 10 12 4 16 18, 20 1 20 40 60 80 100 120 140 160 180 200 


Number of flights Number of flights 
(a) (b) 


Figure 2.2 (a) Initial fluctuation and (b) eventual stabilization of relative frequency 
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More generally, both empirical evidence and mathematical theory indicate that any relative fre- 
quency of this sort will stabilize as the number of replications n increases. That is, as n gets arbitrarily 
large, n(A)/n approaches a limiting value we refer to as the long-run (or limiting) relative frequency 
of the event A. The objective interpretation of probability identifies this limiting relative frequency 
with P(A); e.g., in Figure 2.2b, the limiting relative frequency is .72, and so we say the probability of 
event A is P(A) = .72. A formal justification of this interpretation is provided by the Law of Large 
Numbers, a theorem we’ll encounter in Chapter 6. 

Suppose that probabilities are assigned to events in accordance with their limiting relative fre- 
quencies. Then a statement such as “the probability of a flight arriving on time is .72” means that of a 
large number of flights, roughly 72% will arrive on time. Similarly, if B is the event that a certain 
brand of dishwasher will need service while under warranty, then “P(B) = .1” is interpreted to mean 
that in the long run 10% of all such dishwashers will need warranty service. This does not mean that 
exactly 1 out of every 10 will need service, or exactly 20 out of 200 will need service, because 10 and 
200 are not the long run. Such misinterpretations of probability as a guarantee on short-term outcomes 
are at the heart of the infamous gambler’s fallacy. 

This relative frequency interpretation of probability is said to be objective because it rests on a 
property of the experiment rather than on any particular individual concerned with the experiment. 
For example, two different observers of a sequence of coin tosses should both use the same proba- 
bility assignments, since the observers have nothing to do with limiting relative frequency. 

In practice, this interpretation is not as objective as it might seem, because the limiting relative 
frequency of an event will not be known. Thus we will have to assign probabilities based on our 
beliefs about the limiting relative frequency of events under study. Fortunately, there are many 
experiments for which there will be a consensus with respect to probability assignments. When we 
speak of a fair coin, we shall mean P(A) = P(T) = .5, and a fair die is one for which limiting 
relative frequencies of the six outcomes are all equal, suggesting probability assignments 
P(1) =: = P(6) = 1/6. 

Because the objective interpretation of probability is based on the notion of limiting frequency, its 
applicability is limited to experimental situations that are repeatable. Yet the language of probability 
is often used in connection with situations that are inherently unrepeatable. Examples include: “The 
chances are good for a peace agreement”; “It is likely that our company will be awarded the contract’; 
and “Because their best quarterback is injured, I expect them to score no more than 10 points against 
us.” In such situations we would like, as before, to assign numerical probabilities to various outcomes 
and events (e.g., the probability is .9 that we will get the contract). We must therefore adopt an 
alternative interpretation of these probabilities. Because different observers may have different prior 
information and opinions concerning such experimental situations, probability assignments may now 
differ from individual to individual. Interpretations in such situations are thus referred to as subjective. 
The book by Winkler listed in the references gives a very readable survey of several subjective 
interpretations. Importantly, even subjective interpretations of probability must satisfy the three 
axioms (and all properties that follow from the axioms) in order to be valid. 


More Probability Properties 


COMPLEMENT RULE For any event A, P(A) = | — P(A’). 
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Proof Since by definition of A’, A U A'= 8 while A and A’ are disjoint, 1 = P(%) = P(A UA’) = P(A) + P(A’), 
from which the desired result follows. |_| 


This proposition is surprisingly useful because there are many situations in which P(A’) is more easily 
obtained by direct methods than is P(A). 


Example 2.13 Consider a system of five identical components connected in series, as illustrated in 
Figure 2.3. 


Figure 2.3 A system of five components connected in series 


Denote a component that fails by F and one that doesn’t fail by S (for success). Let A be the event 
that the system fails. For A to occur, at least one of the individual components must fail. Outcomes in 
A include SSFSS (1, 2, 4, and 5 all work, but 3 does not), FFSSS, and so on. There are, in fact, 31 
different outcomes in A! However, A’, the event that the system works, consists of the single outcome 
SSSSS. We will see in Section 2.5 that if 90% of all these components do not fail and different 
components fail independently of one another, then P(A’) = 9° = .59. Thus P(A) = 1 —- .59 = .41; so 
among a large number of such systems, roughly 41% will fail. i 


In general, the Complement Rule is useful when the event of interest can be expressed as “at least 
.... because the complement “less than ...” may be easier to work with. (In some problems, “more 
than ...” is easier to deal with than “at most ....”) When you are having difficulty calculating 
P(A) directly, think of first determining P(A’). 


PROPOSITION For any event A, P(A) < 1. 


This follows from the previous proposition: 1 = P(A)+ P(A’) > P(A), because P(A’) > 0 by 
Axiom 1. 

When A and B are disjoint, we know that P(A UB) = P(A) + P(B). How can this union proba- 
bility be obtained when the events are not disjoint? 


ADDITION RULE For any events A and B, 


P(AUB) = P(A) + P(B) — P(ANB). 


Notice that the proposition is valid even if A and B are disjoint, since then P(A B) = 0. The key idea 
is that, in adding P(A) and P(B), the probability of the intersection AM B is actually counted twice, so 
P(AMB) must be subtracted out. 


Proof Note first that AUB = AU(BNA’), as shown in Figure 2.4 (p. 60). Because A and (BM A’) are 
disjoint, P(AUB) = P(A) + P(BN A’). But B= (BNA)U(BNA’)(the union of the part of B in 
A and the part of B not in A). Furthermore, (BNA) and (BNMA’) are disjoint, so 
P(B) = P(BNA) + P(BNA’). Combining these results gives 
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P(AUB) = P(A) + P(BNA’) = P(A) +[P(B) — P(ANB)] = P(A) + P(B) — P(AMB) 


Figure 2.4 Representing AUB as a union of disjoint events a 


Example 2.14 In a certain residential suburb, 60% of all households get internet service from the 
local cable company, 80% get television service from that company, and 50% get both services from 
the company. If a household is randomly selected, what is the probability that it gets at least one of 
these two services from the company, and what is the probability that it gets exactly one of the 
services from the company? 

With A = {gets internet service from the cable company} and B = {gets television service from 
the cable company}, the given information implies that P(A) = .6, P(B) = .8, and P(ANB) = .5. The 
Addition Rule then applies to give 


P(gets at least one of these two services from the company) 
= P(AUB) = P(A) + P(B) — P(ANB) = 6+ .8—.5=.9 


The event that a household gets only television service from the company can be written as A’M B, i.e. 
(not internet) and television. Now Figure 2.4 implies that 


.9 = P(AUB) = P(A) + P(A‘NB) = .6+P(A'NB) 


from which P(A'N B) = .3. Similarly, P(AN B’) = P(AUB) — P(B) = .1. This is all illustrated in 
Figure 2.5, from which we see that 


P(exactly one) = P(ANB’) + P(A'NB) = .14+.3=.4 


P(A NB) P(A'™ B) 


Figure 2.5 Probabilities for Example 2.14 | 


The probability of a union of more than two events can be computed analogously. For three events A, 
B, and C, the result is 


P(AUBUC) = P(A) + P(B) + P(C) — P(ANB) — P(ANC) — P(BNC)+P(ANBNC) 


This can be seen by examining a Venn diagram of AU BUC, which is shown in Figure 2.6. 
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Figure 2.6 AUBUC 


When P(A), P(B), and P(C) are added, outcomes in certain intersections are double counted and 
the corresponding probabilities must be subtracted. But this results in P(AM BMC) being subtracted 
once too often, so it must be added back. One formal proof involves applying the Addition Rule to 
P((AUB)UOQ), the probability of the union of the two events AUB and C; see Exercise 30. More 
generally, a result concerning P(A; U --- U A,) can be proved by induction or by other methods. The 
pattern of additions and subtractions (or, equivalently, the method of deriving such union probability 
formulas) is often called the inclusion—exclusion principle. 


Determining Probabilities Systematically 

When the number of possible outcomes (simple events) is large, there will be many compound events. 
A simple way to determine probabilities for these events that avoids violating the axioms and derived 
properties is to first determine probabilities P(Z;) for all simple events. These should satisfy 
P(E;) > Oand 5°; P(£;) = 1. Then the probability of any compound event A is computed by adding 
together the P(E;)’s for all E;’s in A: 


P(A) = }> P(E) 


EjinA 


Example 2.15 During off-peak hours a commuter train has five cars. Suppose a commuter is twice 
as likely to select the middle car (#3) as to select either adjacent car (#2 or #4), and is twice as likely 
to select either adjacent car as to select either end car (#1 or #5). Let p; = P(car iis selected) = P(E;). 
Then we have p3 = 2p2 = 2p4 and pz = 2p, = 2ps = p4. This gives 


Le S > P(Ei) = pi + 2p, + 4p, + 2p, +p: = 10pi 


implying p; = ps = .1, po = p4 = .2, and p3 = .4. The probability that one of the three middle cars is 
selected (a compound event) is then pz + p3 + p4 = .8. a 


Equally Likely Outcomes 

In many experiments consisting of N outcomes, it is reasonable to assign equal probabilities to all 
N simple events. These include such obvious examples as tossing a fair coin or fair die once (or any fixed 
number of times), or selecting one or several cards from a well-shuffled deck of 52. With p = P(E;) 


for every i, 
N 


N 
b=) Pe) =) ope so p= 
i=1 


i=1 i= 


That is, if there are N possible outcomes, then the probability assigned to each is 1/N. 
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Now consider an event A, with M(A) denoting the number of outcomes contained in A. Then 


EjinA EjinA 


Once we have counted the number N of outcomes in the sample space, to compute the probability of 
any event we must count the number of outcomes contained in that event and take the ratio of the two 
numbers. Thus when outcomes are equally likely, computing probabilities reduces to counting. 


Example 2.16 When two dice are rolled separately, there are N = 36 outcomes (delete the first row 
and column from the table in Example 2.3). If both the dice are fair, all 36 outcomes are equally 
likely, so P(E;) = 1/36 for each simple event. The event A = {sum of two numbers is 8} consists of 
the five outcomes (2, 6), (3, 5), (4, 4), (5, 3), and (6, 2), so 


The next section of this book develops some useful counting methods. 


Exercises: Section 2.2 (13-30) 


13. A mutual fund company offers its cus- c. What is the probability that the selected 
tomers several different funds: a money individual does not own shares in a 
market fund, three different bond funds stock fund? 


(short, intermediate, and long term), two 
stock funds (moderate and high risk), and a 
balanced fund. Among customers who own 
shares in just one fund, the percentages of 


14. Consider randomly selecting a student at a 
certain university, and let A denote the event 
that the selected individual has a Visa credit 
card and B be the analogous event for a 


customers in the different funds are as MasterCard. Suppose that P(A) = .5, P(B) 
follows: = 4, and P(AMB) = .25. 
Money make 20% High-risk 18% a. Compute the probability that the selec- 
stock ted individual has at least one of the two 
Suan bane ie ne clas ae types of cards (i.e., the probability of the 
Intermediate bond 10% Balanced 7% event AUB). 
Long bond 5% b. What is the probability that the selected 
individual has neither type of card? 
A customer who owns shares in just one c. Describe, in terms of A and B, the event 
fund is randomly selected. that the selected student has a Visa card 
a. What is the probability that the selected but not a MasterCard, and then calculate 
individual owns shares in the balanced the probability of this event. 
fund? 15. A consulting firm presently has bids out on 
b. What is the probability that the indi- three projects. Let A; = {awarded project i}, 
vidual owns shares in a bond fund? fori = 1, 2,3, and suppose that P(A,) = .22, 


P(A) = .25, P(A3) = .28, P(A, Ad) = 11, 
P(A, NA3) = .05, P(A.MA3) = .07, and 
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16. 


17. 


18. 


P(A; NA2MA3) = .01. Express in words 
each of the following events, and compute 
the probability of each event: 

a. A ,UA>) 

b. ALMA, [Hint: (Ay UA2)’ = A, NAS] 
c. A,;UA2UA3 

d. A, NANA, 

e. A, MNA,NA3 

f. (A, NAS) UA 


A particular state will elect both a governor 
and a senator. Let A be the event that a 
randomly selected voter has a favorable 
view of a certain party’s senatorial candi- 
date, and let B be the corresponding event 
for that party’s gubernatorial candidate. 
Suppose that P(A’) = .44, P(B’) = .57, 
and P(A UB) = .68. 


a. What is the probability that a randomly 
selected voter has a favorable view of 
both candidates? 

b. What is the probability that a randomly 
selected voter has an unfavorable view 
of at least one of these candidates? 

c. What is the probability that a randomly 
selected voter has a favorable view of 
exactly one of these candidates? 


Consider the type of clothes dryer (gas or 
electric) purchased by each of five different 
customers at a certain store. 


a. If the probability that at most one of 
these customers purchases an electric 
dryer is .428, what is the probability that 
at least two purchase an electric dryer? 

b. If P(all five purchase gas) = .116 and 
P(all five purchase electric) = .005, 
what is the probability that at least one 
of each type is purchased? 


An individual is presented with three dif- 
ferent glasses of cola, labeled C, D, and 
P. He is asked to taste all three and then list 
them in order of preference. Suppose the 
same cola has actually been put into all 
three glasses. 


19. 


20. 


21, 
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a. What are the simple events in this 
ranking experiment, and what proba- 
bility would you assign to each one? 

b. What is the probability that C is ranked 
first? 

c. What is the probability that C is ranked 
first and D is ranked last? 


Let A denote the event that the next request 
for assistance from a statistical software 
consultant relates to the SPSS package, and 
let B be the event that the next request is for 
help with SAS. Suppose that P(A) = .30 
and P(B) = .50. 


a. Why is it not the case that P(A) + 
P(B) = 1? 

b. Calculate P(A’). 

c. Calculate P(A UB). 

d. Calculate P(A’ B’). 


A box contains four 40-W bulbs, five 60-W 
bulbs, and six 75-W bulbs. If bulbs are 
selected one by one in random order, what is 
the probability that at least two bulbs must 
be selected to obtain one that is rated 75 W? 


Human visual inspection of solder joints on 
printed circuit boards can be very subjec- 
tive. Part of the problem stems from the 
numerous types of solder defects (e.g., pad 
nonwetting, knee visibility, voids) and even 
the degree to which a joint possesses one or 
more of these defects. Consequently, even 
highly trained inspectors can disagree on 
the disposition of a particular joint. In one 
batch of 10,000 joints, inspector A found 
724 that were judged defective, inspector B 
found 751 such joints, and 1159 of the 
joints were judged defective by at least one 
of the inspectors. Suppose that one of the 
10,000 joints is randomly selected. 


a. What is the probability that the selected 
joint was judged to be defective by nei- 
ther of the two inspectors? 

b. What is the probability that the selected 
joint was judged to be defective by 
inspector B but not by inspector A? 
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22; 


Shift 


Day 
Swing 8% 
Night 5% 


23. 


A factory operates three different shifts. 
Over the last year, 200 accidents have 
occurred at the factory. Some of these can 
be attributed at least in part to unsafe 
working conditions, whereas the others are 
unrelated to working conditions. The 
accompanying table gives the percentage of 
accidents falling in each type of accident— 
shift category. 


Unrelated to conditions 
35% 


20% 
22% 


Unsafe conditions 


10% 


Suppose one of the 200 accident reports is 
randomly selected from a file of reports, 
and the shift and type of accident are 
determined. 


a. What are the simple events? 

b. What is the probability that the selected 
accident was attributed to unsafe 
conditions? 

c. What is the probability that the selected 
accident did not occur on the day shift? 


An insurance company offers four different 
deductible levels—none, low, medium, and 
high—for its homeowner’s policyholders and 
three different levels—low, medium, and 
high—for its automobile policyholders. The 
accompanying table gives proportions for the 
various categories of policyholders who have 
both types of insurance. For example, the 
proportion of individuals with both low 
homeowner’s deductible and low auto 
deductible is .06 (6% of all such individuals). 


Homeowner’s 


Auto N L M H 


04 .06 .05 03 
07 10 .20 10 
02 .03 AS 15 


Suppose an individual having both types of 
policies is randomly selected. 
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a. What is the probability that the 
individual has a medium auto deduc- 
tible and a _ high homeowner’s 
deductible? 

b. What is the probability that the indi- 
vidual has a low auto deductible? A low 
homeowner’s deductible? 

c. What is the probability that the 
individual is in the same category 
for both auto and homeowner’s 
deductibles? 

d. Based on your answer in part (c), what 
is the probability that the two categories 
are different? 

e. What is the probability that the indi- 
vidual has at least one low deductible 
level? 

f. Using the answer in part (e), what is the 
probability that neither deductible level 
is low? 

The route used by a driver in commuting 

to work contains two intersections with 

traffic signals. The probability that he 
must stop at the first signal is .4, the 
analogous probability for the second sig- 
nal is .5, and the probability that he must 
stop at one or more of the two signals is 

.6. What is the probability that he must 

stop 


a. At both signals? 

b. At the first signal but not at the second 
one? 

c. At exactly one signal? 


The computers of six faculty members in a 
certain department are to be replaced. Two 
of the faculty members have selected lap- 
top machines, and the other four have 
chosen desktop machines. Suppose that 
only two of the setups can be done on a 
particular day, and the two computers to 
be set up are randomly selected from the 
six (implying 15 equally likely outcomes; 
if the computers are numbered 1, 2, ..., 6, 
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26. 
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then one outcome consists of computers | 
and 2, another consists of computers | and 
3, and so on). 


a. What is the probability that both selec- 
ted setups are for laptop computers? 

b. What is the probability that both selec- 
ted setups are desktop machines? 

c. What is the probability that at least one 
selected setup is for a desktop 
computer? 

d. What is the probability that at least one 
computer of each type is chosen for 
setup? 


Use the axioms to show that if one event 
A is contained in another event B (i.e., A is 
a subset of B), then P(A) < P(B). [Hint: 
For such A and B, A and BMA' are disjoint 
and B = AU(BNA’), as can be seen from a 
Venn diagram.] For general A and B, what 
does this imply about the relationship 
among P(AMB), P(A), and P(AUB)? 


The three major options on a car model 
are an automatic transmission (A), a sun- 
roof (B), and an upgraded stereo (C). If 
70% of all purchasers request A, 80% 
request B, 75% request C, 85% request 
A or B, 90% request A or C, 95% request 
B or C, and 98% request A or B or C, 
compute the probabilities of the following 
events. [Hint: “A or B” is the event that at 
least one of the two options is requested; 
try drawing a Venn diagram and labeling 
all regions.] 


a. The next purchaser will request at least 
one of the three options. 

b. The next purchaser will select none of 
the three options. 


28. 


29. 


30. 
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c. The next purchaser will request only an 
automatic transmission and neither of 
the other two options. 
d. The next purchaser will select exactly 
one of these three options. 


A certain system can experience three dif- 
ferent types of defects. Let A; (i = 1, 2, 3) 
denote the event that the system has a 
defect of type i. Suppose that 


P(A\) =.12 P(A) = .07 P(A3) = .05 
P(A2UA3) = .10 P(A; NA2NA3) = .01 


a. What is the probability that the system 
does not have a type | defect? 

b. What is the probability that the system 
has both type | and type 2 defects? 

c. What is the probability that the system 
has both type 1 and type 2 defects but 
not a type 3 defect? 

d. What is the probability that the system 
has at most two of these defects? 


In Exercise 7, suppose that any incoming 
individual is equally likely to be assigned to 
any of the three stations irrespective of 
where other individuals have been 
assigned. What is the probability that 


a. All three family members are assigned 
to the same station? 

b. At most two family members are 
assigned to the same station? 

c. Every family member is assigned to a 
different station? 


Apply the Addition Rule to the union of the 


two events (A U B) and C in order to verify 
the formula for P(AU BUC). 
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When the various outcomes of an experiment are equally likely (the same probability is assigned to each 
simple event), the task of computing probabilities reduces to counting. In particular, if Nis the number of 
outcomes in a sample space and NA) is the number of outcomes contained in an event A, then 


re en ane (2.1) 


If a list of the outcomes is available or easy to construct and N is small, then the numerator and 
denominator of Equation (2.1) can be obtained without the benefit of any general counting principles. 

There are, however, many experiments for which the effort involved in constructing such a list is 
prohibitive because N is quite large. By exploiting some general counting rules, it is possible to 
compute probabilities of the form (2.1) without a listing of outcomes. These rules are also useful in 
many problems involving outcomes that are not equally likely. Several of the rules developed here 
will be used in studying probability distributions in the next chapter. 


The Fundamental Counting Principle 

Our first counting rule applies to any situation in which an event consists of ordered pairs of objects and 
we wish to count the number of such pairs. By an ordered pair, we mean that, if O, and O, are objects, 
then the pair (O;, O2) is different from the pair (Oz, O;). For example, if an individual selects one airline 
for a trip from Los Angeles to Chicago and a second one for continuing on to New York, one possibility is 
(American, United), another is (United, American), and still another is (United, United). 


PROPOSITION If the first element or object of an ordered pair can be selected in n, ways, and 
for each of these n, ways the second element of the pair can be selected in nz 
ways, then the number of pairs is n,n». 


Example 2.17 A homeowner doing some remodeling requires the services of both a plumbing 
contractor and an electrical contractor. If there are 12 plumbing contractors and 9 electrical con- 
tractors available in the area, in how many ways can the contractors be chosen? If we denote the 


plumbers by P,..., Piz and the electricians by Q),..., Qo, then we wish the number of pairs of the 
form (P;, Q;). With n, = 12 and nz = 9, the proposition yields N = (12)(9) = 108 possible ways of 
choosing the two types of contractors. a 


In Example 2.17, the choice of the second element of the pair did not depend on which first 
element was chosen or occurred. As long as there is the same number of choices of the second 
element for each first element, the foregoing proposition is valid even when the set of possible second 
elements depends on the first element. 


Example 2.18 A family has just moved to a new city and requires the services of both an obste- 
trician and a pediatrician. There are two easily accessible medical clinics, each having two obste- 
tricians and three pediatricians. The family will obtain maximum health insurance benefits by joining 
a clinic and selecting both doctors from that clinic. In how many ways can this be done? Denote the 
obstetricians by O;, O2, 03, and O, and the pediatricians by P),...,P¢. Then we wish the number of 
pairs (O;, P;) for which O; and P; are associated with the same clinic. Because there are four 
obstetricians, n, = 4, and for each there are three choices of pediatrician, so nz = 3. Applying the 
proposition rule gives N = ninz = 12 possible choices. 
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If a six-sided die is tossed five times in succession, then each possible outcome is an ordered 
collection of five numbers such as (1, 3, 1, 2, 4) or (6, 5, 2, 2, 2). We will call an ordered collection of 
k objects a k-tuple (so a pair is a 2-tuple and a triple is a 3-tuple). Each outcome of the die-tossing 
experiment is then a 5-tuple. The following theorem, called the Fundamental Counting Principle, 
generalizes the previous proposition to k-tuples. 


FUNDAMENTAL Suppose a set consists of ordered collections of k elements (k-tuples) and that 


COUNTING there are n, possible choices for the first element; for each choice of the first 

PRINCIPLE element, there are n2 possible choices of the second element; ...; for each 
possible choice of the first k — 1 elements, there are n, choices of the kth 
element. Then there are njn2- ... - ng possible k-tuples. 


Example 2.19 (Example 2.17 continued) Suppose the home remodeling job involves first pur- 
chasing several kitchen appliances. They will all be purchased from the same dealer, and there are five 
dealers in the area. With the dealers denoted by D,, ..., Ds, there are N = nyngn3z = (5)(12)9) = 540 
3-tuples of the form (D;, P;, Q;), so there are 540 ways to choose first an appliance dealer, then a 
plumbing contractor, and finally an electrical contractor. a 


Example 2.20 (Example 2.18 continued) If each clinic has both three specialists in internal 
medicine and two general surgeons, there are n,;n2n3n4 = (4)(3)(3)(2) = 72 ways to select one doctor 
of each type such that all doctors practice at the same clinic. a 


Tree Diagrams 

In many counting and probability problems, a tree diagram can be used to represent pictorially all 
the possibilities. The tree diagram associated with Example 2.18 appears in Figure 2.7. Starting from 
a point on the left side of the diagram, for each possible first element of a pair a straight-line segment 
emanates rightward. Each of these lines is referred to as a first-generation branch. Now for any given 
first-generation branch we construct another line segment emanating from the tip of the branch for 
each possible choice of a second element of the pair. Each such line segment is a second-generation 
branch. Because there are four obstetricians, there are four first-generation branches, and three 
pediatricians for each obstetrician yield three second-generation branches emanating from each first- 
generation branch. 


Figure 2.7 Tree diagram for Example 2.18 
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Permutations 
So far the successive elements of a k-tuple were selected from entirely different sets (e.g., appliance 
dealers, then plumbers, and finally electricians). In several tosses of a die, the set from which 
successive elements are chosen is always {1, 2, 3, 4, 5, 6}, but the choices are made “with 
replacement” so that the same element can appear more than once. If the die is rolled once, there are 
obviously 6 possible outcomes; for two rolls, there are 6” = 36 possibilities, since we distinguish 
(3, 5) from (5, 3). In general, if k selections are made with replacement from a set of distinct objects 
(such as the six sides of a die), then the total number of possible outcomes is nt. 

We now consider a fixed set consisting of n distinct elements and suppose that a k-tuple is formed 
by selecting successively from this set without replacement so that an element can appear in at most 
one of the k positions. 


DEFINITION Any ordered sequence of k objects taken without replacement from a set of n dis- 
tinct objects is called a permutation of size k of the objects. The number of 
permutations of size k that can be constructed from the 7 objects is denoted by ,,Px. 


The number of permutations of size k is obtained immediately from the Fundamental Counting 
Principle. The first element can be chosen in n ways; for each of these n ways the second element can 
be chosen in n — | ways; and so on. Finally, for each way of choosing the first k — 1 elements, the Ath 
element can be chosen in n — (k - 1) =n —k + 1 ways, so 


nPp = n(n — 1)(n—2)----- (n—k+2)(n—k+1) 


Example 2.21 Ten teaching assistants are available for grading papers in a particular course. The 
first exam consists of four questions, and the professor wishes to select a different assistant to grade 
each question (only one assistant per question). In how many ways can assistants be chosen to grade 
the exam? Here n = the number of assistants = 10 and k = the number of questions = 4. The number 
of different grading assignments is then ;9P4 = (10)(9)(8)(7) = 5040. B 


Example 2.22 The Birthday Problem. Disregarding the possibility of a February 29 birthday, 
suppose a randomly selected individual is equally likely to have been born on any one of the other 
365 days. If ten people are randomly selected, what is the probability that all have different birthdays? 

Imagine we draw ten days, with replacement, from the calendar to represent the birthdays of the 
ten randomly selected people. One possible outcome of this selection would be (March 31, December 
30, ..., September 27, February 12). There are 365'° such outcomes. The number of outcomes among 
them with no repeated birthdays is 


(365)(364) ----- (356) = 365Pi0 


(any of the 365 calendar days may be selected first; if March 31 is chosen, any of the other 364 days 
is acceptable for the second selection; and so on). Hence, the probability all ten randomly selected 
people have different birthdays equals 365P,o/365'° = .883. Equivalently, there’s only a .117 chance 
that at least two people out of these ten will share a birthday. It’s worth noting that the first probability 
can be rewritten as 
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365P10 _ 365 364 356 


36510 365 365 365 


We may think of each fraction as representing the chance the next birthday selected will be different 
from all previous ones. (This is an example of conditional probability, the topic of the next section.) 

Now replace 10 with k (i.e., k randomly selected birthdays); what is the smallest k for which there 
is at least a 50-50 chance that two or more people will have the same birthday? Most people 
incorrectly guess that we need a very large group of people for this to be true; the most common guess 
is that 183 people are required (half the days on the calendar). But the required value of k is actually 
much smaller: the probability that k randomly selected people all have different birthdays equals 
365P;/365*, which not surprisingly decreases as k increases. Figure 2.8 displays this probability for 
increasing values of k. As it turns out, the smallest k for which this probability falls below .5 is just 
k = 23. That is, there is less than a 50-50 chance (.4927, to be precise) of 23 randomly selected 
people all having different birthdays, and thus a probability .5073 that at least two people in a random 
sample of 23 will share a birthday. 


P(no shared birthdays among k people) 


Figure 2.8 P(no birthday match) in Example 2.22 a 


The expression for ,,P;, can be rewritten with the aid of factorial notation. Recall that 7! (read “7 
factorial”) is compact notation for the descending product of integers (7)(6)(5)(4)(3)(2)(1). More 
generally, for any positive integer m, m! = m(m — 1)(m—2)----- (2)(1). This gives 1! = 1, and we 
also define 0! = 1. 

Using factorial notation, (10)(9)(8)(7) = (10)(9)(8)(7)(6!)/6! = 10!/6!. More generally, 


nPy =n(n—1)--++-(n—k+1) 
nna le (n= kt+ Din H(n—k= 1) vee (2)(1) 
@—HR—k-D--- OD) 


which becomes 
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For example, oP; = 9!/(9 — 3)! = 9!/61=9-8-7-6!/6! =9- 8 - 7. Note also that because 0! = 1, 
nP, = n(n — n)! = n!/0! = n!/1 = n!, as it should. 


Combinations 

Often the objective is to count the number of unordered subsets of size k that can be formed from a 
set consisting of n distinct objects. For example, in bridge it is only the 13 cards in a hand and not the 
order in which they are dealt that is important; in the formation of a committee, the order in which 
committee members are listed is frequently unimportant. 


DEFINITION Given a set of n distinct objects, any unordered subset of size k of the objects is 
called a combination. The number of combinations of size k that can be formed 


from n distinct objects will be denoted by (;) or Cy. 


The number of combinations of size k from a particular set is smaller than the number of permutations 
because, when order is disregarded, some of the permutations correspond to the same combination. 
Consider, for example, the set {A, B, C, D, E} consisting of five elements. There are 5P; = 
5!/(5 — 3)! = 60 permutations of size 3. There are six permutations of size 3 consisting of the 
elements A, B, and C because these three can be ordered 3 - 2 - 1 = 3! = 6 ways: (A, B, C), (A, C, B), 
(B, A, C), (B, C, A), (C, A, B), and (C, B, A). These six permutations are equivalent to the single 
combination {A, B, C}. Similarly, for any other combination of size 3, there are 3! permutations, each 
obtained by ordering the three objects. Thus, 


60 = 5P3 = 3) -3! so & =F =10 


These ten combinations are 


{A,B,C} {A,B,D} {A,B,E} {A,C,D} {A,C,E} 
{A,D,E} {B,C,D} {B,C,E} {B,D,E} {C,D,E} 


When there are 7 distinct objects, any permutation of size k is obtained by ordering the k unordered 
objects of a combination in one of k! ways, so the number of permutations is the product of k! and the 
number of combinations. This gives 


Notice that (") = | and « = | because there is only one way to choose a set of (all) n elements or 


of no elements, and (7) =n Since there are n subsets of size 1. 


Example 2.23 A bridge hand consists of any 13 cards selected from a 52-card deck without regard 


52 
13 


635 billion. Since there are 13 cards in each suit, the number of hands consisting entirely of clubs 


and/or spades (no red cards) is @) = 26!/(13! - 13!) = 10,400,600. One of these (a) hands 


to order. There are ( ) = 52!/(13! - 39!) different bridge hands, which works out to approximately 


consists entirely of spades, and one consists entirely of clubs, so there are (3) - hands that 
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consist entirely of clubs and spades with both suits represented in the hand. Suppose a bridge hand is 
dealt from a well-shuffled deck (i.e., 13 cards are randomly selected from among the 52 possibilities) 
and let 


A = {the hand consists entirely of spades and clubs with both suits represented } 
B = {the hand consists of exactly two suits} 


The N = (a) possible outcomes are equally likely, so 


P(A) = wa) = ( " : = .0000164 
13 


Since there are & = 6 combinations consisting of two suits, of which spades and clubs is one such 


combination, 


26 
6 —2 
way _§{(3s) 4 
P(B) = = = .0000983 
co N 52 
13 
That is, a hand consisting entirely of cards from exactly two of the four suits will occur roughly once 


in every 10,000 hands. If you play bridge only once a month, it is likely that you will never be dealt 
such a hand. Ei 


Example 2.24 A university has received a shipment of 25 new laptops for staff and faculty, of 
which 10 have AMD processors and 15 have Intel chips. If 6 of these 25 laptops are selected at 
random to be checked by a technician, what is the probability that exactly 3 of those selected have 
Intel processors (so that the other 3 are AMD)? 

Let D3 = {exactly 3 of the 6 selected have Intel processors}. Assuming that any particular set of 
6 laptops is as likely to be chosen as is any other set of 6, we have equally likely outcomes, so P(D3) = 
N(D3)/N, where Nis the number of ways of choosing 6 laptops from the 25 and N(D3) is the number of ways 


of choosing 3 with AMD processors and 3 with Intel chips. Thus NV = ( - ) . To obtain M(D3), think of first 


choosing 3 of the 15 Intel laptops and then 3 of the AMD laptops. There are ( = 


3 ) ways of choosing the 3 


with Intel processors, and there are & ) ways of choosing the 3 with AMD processors; by the 


Fundamental Counting Principle, N(D3) is the product of these two numbers. So, 


( 15\ (10 15! 10! 
_N@s) _\ 3 7A 3 7 _ 3at 317! 
E(D3) = = 7 = Sap = 3083 
6 6!19! 
Next, let Dy = {exactly 4 of the 6 laptops selected have Intel processors} and define Ds and Dg in an 


analogous manner. Notice that the events D3, D4, Ds, and Dg are disjoint. Thus, the probability that at 
least 3 laptops with Intel processors are selected is 
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P(D3 U D4 U Ds UDg) = P(D3) + P(D4) + P(Ds) + P(Do) 


GIG) CG) G) 


) (OC) .. 


(s) Gs) 


Exercises: Section 2.3 (31-48) 


31. 


32. 


33. 


The College of Science Student Council 
has one representative from each of the five 
science departments (biology, chemistry, 
statistics, mathematics, physics). In how 
many ways can 


a. Both a council president and a vice 
president be selected? 

b. A president, a vice president, and a 
secretary be selected? 

c. Two members be selected for the 
Dean’s Council? 

A friend is giving a dinner party. Her cur- 

rent wine supply includes 8 bottles of zin- 

fandel, 10 of merlot, and 12 of cabernet 

(she drinks only red wine), all from differ- 

ent wineries. 


a. If she wants to serve 3 bottles of 
zinfandel and serving order is important, 
how many ways are there to do this? 

b. If 6 bottles of wine are to be randomly 
selected from the 30 for serving, how 
many ways are there to do this? 

c. If 6 bottles are randomly selected, how 
many ways are there to obtain two 
bottles of each variety? 

d. If 6 bottles are randomly selected, what 
is the probability that this results in two 
bottles of each variety being chosen? 

e. If 6 bottles are randomly selected, what 
is the probability that all of them are the 
same variety? 


a. Beethoven wrote 9 symphonies and 
Mozart wrote 27 piano concertos. If a 
university radio station announcer 


) GG) . 


wishes to play first a Beethoven sym- 
phony and then a Mozart concerto, in 
how many ways can this be done? 


b. The station manager decides that on 
each successive night (7 days per 
week), a Beethoven symphony will be 
played, followed by a Mozart piano 
concerto, followed by a Schubert string 
quartet (of which there are 15). For 
roughly how many years could this 
policy be continued before exactly the 
same program would have to be 
repeated? 


34. A chain of home electronics stores is 


offering a special price on a complete set of 
components (receiver, CD/MP3_ player, 
speakers). A purchaser is offered a choice 
of manufacturer for each component: 


Receiver Kenwood, Onkyo, Pioneer, Sony, 
Yamaha 
CD/MP3 Onkyo, Pioneer, Sony, Panasonic 
player 
Speakers Boston, Infinity, Polk 


A switchboard display in the store allows a 
customer to connect any selection of com- 
ponents (consisting of one of each type). 
Use the product rules to answer the fol- 
lowing questions: 


a. In how many ways can one component 
of each type be selected? 

b. In how many ways can components be 
selected if both the receiver and the 
CD/MP3 player are to be Sony? 

c. In how many ways can components be 
selected if none is to be Sony? 
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35. 


36. 


37. 


d. In how many ways can a selection be 
made if at least one Sony component is 
to be included? 

e. If someone flips switches on the selec- 
tion in a completely random fashion, 
what is the probability that the system 
selected contains at least one Sony 
component? Exactly one Sony 
component? 


A particular iPod playlist contains 100 
songs, of which 10 are by the Beatles. 
Suppose the shuffle feature is used to play 
the songs in random order (the randomness 
of the shuffling process is investigated in 
“Does Your iPod Really Play Favorites?” 
(The Amer. Statistician 2009: 263-268)). 
What is the probability that the first Beatles 
song heard is the fifth song played? 


A local bar stocks 12 American beers, 8 
Mexican beers, and 9 German beers. You 
ask the bartender to pick out a five-beer 
“sampler” for you. Assume the bartender 
makes the five selections at random and 
without replacement. 


a. What is the probability you get at least 
four American beers? 

b. What is the probability you get five 
beers from the same country? 


The statistics department at the authors’ 
university participates in an annual volley- 
ball tournament. Suppose that all 16 
department members are willing to play. 


a. How many different six-person volley- 
ball rosters could be generated? (That is, 
how many years could the department 
participate in the tournament without 
repeating the same six-person team?) 

b. The statistics department faculty consists 
of 5 women and 11 men. How many 
rosters comprised of exactly 2 women 
and 4 men be generated? 

c. The tournament’s rules actually require 
that each team includes at least two 
women. Under this rule, how many 
valid teams could be generated? 


38. 


39. 


40. 
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d. Suppose this year the department deci- 
des to randomly select its six players. 
What is the probability the randomly 
selected team has exactly two women? 
At least two women? 


A production facility employs 20 workers 
on the day shift, 15 workers on the swing 
shift, and 10 workers on the graveyard 
shift. A quality control consultant is to 
select 6 of these workers for in-depth 
interviews. Suppose the selection is made 
in such a way that any particular group of 6 
workers has the same chance of being 
selected as does any other group (drawing 6 
slips without replacement from among 45). 


a. How many selections result in all 6 
workers coming from the day shift? What 
is the probability that all 6 selected 
workers will be from the day shift? 

b. What is the probability that all 6 selec- 
ted workers will be from the same shift? 

c. What is the probability that at least two 
different shifts will be represented 
among the selected workers? 

d. What is the probability that at least one 
of the shifts will be unrepresented in the 
sample of workers? 


An academic department with five faculty 
members narrowed its choice for depart- 
ment head to either candidate A or candi- 
date B. Each member then voted on a slip 
of paper for one of the candidates. Suppose 
there are actually three votes for A and two 
for B. If the slips are selected for tallying in 
random order, what is the probability that 
A remains ahead of B throughout the vote 
count (for example, this event occurs if the 
selected ordering is AABAB, but not for 
ABBAA)? 


An experimenter is studying the effects of 
temperature, pressure, and type of catalyst 
on yield from a chemical reaction. Three 
different temperatures, four different pres- 
sures, and five different catalysts are under 
consideration. 
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42. 
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a. If any particular experimental run 
involves the use of a single temperature, 
pressure, and catalyst, how many 
experimental runs are possible? 

b. How many experimental runs involve 
use of the lowest temperature and two 
lowest pressures? 


Refer to the previous exercise and suppose 
that five different experimental runs are to 
be made on the first day of experimentation. 
If the five are randomly selected from 
among all the possibilities, so that any 
group of five has the same probability of 
selection, what is the probability that a 
different catalyst is used on each run? 


A box in a certain supply room contains 
four 40-W lightbulbs, five 60-W bulbs, and 
six 75-W bulbs. Suppose that three bulbs 
are randomly selected. 


a. What is the probability that exactly two 
of the selected bulbs are rated 75 W? 

b. What is the probability that all three of 
the selected bulbs have the same rating? 

c. What is the probability that one bulb of 
each type is selected? 

d. Suppose now that bulbs are to be 
selected one by one until a 75-W bulb is 
found. What is the probability that it is 
necessary to examine at least six bulbs? 


Fifteen telephones have just been received 
at an authorized service center. Five of 
these telephones are cellular, five are 
cordless, and the other five are corded 
phones. Suppose that these components are 
randomly allocated the numbers 1, 2, ..., 15 
to establish the order in which they will be 
serviced. 


a. What is the probability that all the 
cordless phones are among the first ten 
to be serviced? 

b. What is the probability that after ser- 
vicing ten of these phones, phones of 
only two of the three types remain to be 
serviced? 
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c. What is the probability that two phones 
of each type are among the first six 
serviced? 


44. Three molecules of type A, three of type B, 


45. 


46. 


three of type C, and three of type D are 
to be linked together to form a chain 


molecule. One such chain molecule is 
ABCDABCDABCD, and_ another is 
BCDDAAABDBCC. 


a. How many such chain molecules are 
there? [Hint: If the three A’s were dis- 
tinguishable from one another—A,, Ao, 
A3—and the B’s, C’s, and D’s were 
also, how many molecules would there 
be? How is this number reduced when 
the subscripts are removed from the 
A’s?] 

b. Suppose a chain molecule of the type 
described is randomly selected. What is 
the probability that all three molecules 
of each type end up next to each other 
(such as in BBBAAADDDCCC)? 


Three married couples have purchased 
theater tickets and are seated in a row 
consisting of just six seats. If they take their 
seats in a completely random fashion (ran- 
dom order), what is the probability that Jim 
and Paula (husband and wife) sit in the two 
seats on the far left? What is the probability 
that Jim and Paula end up sitting next to 
one another? What is the probability that at 
least one of the wives ends up sitting next 
to her husband? 


A popular Dilbert cartoon strip (popular 
among statisticians, anyway) shows an 
allegedly “random” number generator pro- 
ducing the sequence 999999 with the 
accompanying comment, “That’s the prob- 
lem with randomness: you can never be 
sure.” Most people would agree that 
999999 seems less “random” than, say, 
703928, but in what sense is that true? 
Imagine we randomly generate a six-digit 
number; i.e., we make six draws with 
replacement from the digits 0 through 9. 
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a. What is the probability of generating 48. Consider a group of 10 children. 
999999? 
b. What is the probability of generating 


a. How many ways can the children be 
split into groups of sizes 2, 3, and 5? 
703928? [Hint: First select 2 children from the 

c. What is the probability of generating a original 10, then 3 from the remaining 
sequence of six identical digits? 8. Apply the Fundamental Counting 

d. What is the probability of generating a Principle. ] 
sequence with no identical digits? b. Verify that your answer to (a) is 
(Comparing the answers to (c) and 
(d) gives some sense of why some 
sequences feel intuitively more random 
than others.) 

e. Here’s a real challenge: what is the 
probability of generating a sequence 
with exactly one repeated digit? 


equivalent to 54. 

c. Generalize the previous result by 
showing that the number of ways to 
partition n objects into groups of sizes 
ky, ..., ky (with ky +--+ +k, =n) is 

n! 
equal to cz 


47. Show that ({) = Ge: Give an inter- 


pretation involving subsets. 


2.4 Conditional Probability 


The probabilities assigned to various events depend on what is known about the experimental 
situation when the assignment is made. Subsequent to the initial assignment, partial information 
relevant to the outcome of the experiment may become available. Such information may cause us to 
revise some of our probability assignments. For a particular event A, we have used P(A) to represent 
the probability assigned to A; we now think of P(A) as the original or “unconditional” probability of 
the event A. 

In this section, we examine how the information “an event B has occurred” affects the probability 
assigned to A. For example, A might refer to an individual having a particular disease in the presence 
of certain symptoms. If a blood test is performed on the individual and the result is negative 
(B = negative blood test), then the probability of having the disease will change—it should decrease, 
but not usually to zero, since blood tests are not infallible. 


Example 2.25 Complex components are assembled in a plant that uses two different assembly lines, 
A and A’. Line A uses older equipment than A’, so it is somewhat slower and less reliable. Suppose on 
a given day line A has assembled 8 components, of which 2 have been identified as defective (B) and 
6 as nondefective (B’), whereas A’ has produced | defective and 9 nondefective components. This 
information is summarized in the accompanying table. 


Condition 


Line B B' 


> 
N 
a 
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Unaware of this information, the sales manager randomly selects | of these 18 components for a 
demonstration. Prior to the demonstration, 


P(line A component selected) = P(A) 


However, if the chosen component turns out to be defective, then the event B has occurred, so the 
component must have been | of the 3 in the B column of the table. Since these 3 components are 
equally likely among themselves, the probability the component was selected from line A, given that 
event B has occurred, is 


2 2/18 P(AMB) 


P(A, given B) = 33/18 PCB) 


(2.2) 
| 


In Equation (2.2), the conditional probability is expressed as a ratio of unconditional probabilities. 
The numerator is the probability of the intersection of the two events, whereas the denominator is the 
probability of the conditioning event B. A Venn diagram illuminates this relationship (Figure 2.9). 


AN B = what remains 
of event A 
A 


“conditioning” on event B 
> 
B 


g B=new “sample 
space” 


Figure 2.9 Motivating the definition of conditional probability 


Given that B has occurred, the relevant sample space is no longer £ but consists of just outcomes in 
B, and A has occurred if and only if one of the outcomes in the intersection AM B occurred. So the 
conditional probability of A given B should, logically, be the ratio of the likelihoods of these two 
events. 


The Definition of Conditional Probability 

Example 2.25 demonstrates that when outcomes are equally likely, computation of conditional 
probabilities can be based on intuition. When experiments are more complicated, though intuition 
may fail us, we want to have a general definition of conditional probability that will yield intuitive 
answers in simple problems. Figure 2.9 and Equation (2.2) suggest the appropriate definition. 


DEFINITION For any two events A and B with P(B) > 0, the conditional probability of 
A given that B has occurred, denoted P(A|B), is defined by 


P(ANB) 


P(A|B) = PB) 


(2.3) 
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Example 2.26 Suppose that of all individuals buying a new iPhone, 60% include a heavy-duty 
phone case in their purchase, 40% include a portable battery, and 30% include both a heavy- 
duty case and a portable battery. Consider randomly selecting an iPhone buyer and let A = {heavy- 
duty case purchased} and B = {portable battery purchased}. Then P(A) = .60, P(B) = .40, and 
P(both purchased) = P(AMB) = .30. Given that the selected individual purchased a portable battery, 
the probability that a heavy-duty case was also purchased is 

P(ANB)  .30 


P(AIB) = Soy = ag = 75 


That is, of all those purchasing a portable battery, 75% purchased a heavy-duty phone case. Similarly, 


P(ANB : 
P(battery|case) = P(B|A) = ame = = = 50 


Notice that P(A|B) 4 P(A) and P(B|A) 4 P(B). Notice also that P(A|B) 4 P(BIA): these represent two 
different probabilities computed using different pieces of “given” information. a 


Example 2.27 A culture website includes three sections entitled “Art” (A), “Books” (B), and 
“Cinema” (C). Reading habits of a randomly selected reader with respect to these sections are 


Read regularly A B Cc ANB ANC BNC ANBNC 
Probability 14 23 37 .08 .09 13 .05 


Figure 2.10 encapsulates this information. 


is 
(oon 
ail Ney) 


Figure 2.10 Venn diagram for Example 2.27 


We thus have 


P(ANB) _ .08 
P(AIB) = bay = 9g = 348 
_ P(AN(BUC)) 044.054.0312 _ 
P(alpuc) = “Ere ee a 85 
P(AN (AUB 
P(Alreads at least one) = P(AJ|AUBUC) = (AN (AUBUC)) 
P(AUBUC) 
P(A 14 
= a = .286 


~ P(AUBUC) 49 
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and 


P((AUB)NC) _ .04+4+.05+.08 _ 


P(AUB|C) = ney a = 459 | 


The Multiplication Rule for P(A/ B) 
The definition of conditional probability yields the following result, obtained by multiplying both 
sides of Equation (2.3) by P(B). 


MULTIPLICATION RULE P(ANMB) = P(A|B) - P(B) 


This rule is important because it is often the case that P(AMB) is desired, whereas both P(B) and 
P(A|B) can be specified from the problem description. By reversing the roles of A and B, the 
Multiplication Rule can also be written as P(AMB) = P(B|A) - P(A). 


Example 2.28 Four individuals have responded to a request by a blood bank for blood donations. 
None of them has donated before, so their blood types are unknown. Suppose only type O+ is desired 
and only one of the four actually has this type. If the potential donors are selected in random order for 
typing, what is the probability that at least three individuals must be typed to obtain the desired type? 

Define B = {first type not O+} and A = {second type not O+}. Since three of the four potential 
donors are not O+, P(B) = 3/4. Given that the first person typed is not O+, two of the three indi- 
viduals left are not O+, and so P(A|B) = 2/3. The Multiplication Rule now gives 


P(at least three individuals are typed) = P(first two typed are not O+) 


The Multiplication Rule is most useful when the experiment consists of several stages in succession. 
The conditioning event B then describes the outcome of the first stage and A the outcome of the 
second, so that P(A|B)—conditioning on what occurs first—will often be known. The rule is easily 
extended to experiments involving more than two stages. For example, 


P(A\ NA2 MAs3) = P(A3|A1 MA2) -P(Ai MA2) 


= P(As|A1 A>) - P(Ad|A1) - P(A1) 24) 


where A, occurs first, followed by A>, and finally A3. 


Example 2.29 Using Equation (2.4) for the blood typing experiment of Example 2.28, 
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P(third type is O+) 
= P(third is | first isn’t N second isn’t) - P(second isn’t | first isn’t) - P(first isn’t) 


Ser =-= 25 = 


When the experiment of interest consists of a sequence of several stages, it is convenient to represent 
these with a tree diagram. Once we have an appropriate tree diagram, probabilities and conditional 
probabilities can be entered on the various branches; this will make repeated use of the Multiplication 
Rule quite straightforward. 


Example 2.30 An online retailer sells three different brands of Bluetooth earbuds. Of its earbud 
sales, 50% are brand | (the least expensive), 30% are brand 2, and 20% are brand 3. Each manu- 
facturer offers a 1-year warranty. It is known that 25% of brand 1’s earbuds will be returned within 
the 1-year warranty period, whereas the corresponding percentages for brands 2 and 3 are 20% and 
10%, respectively. 


1. What is the probability that a randomly selected purchaser has bought brand 1 earbuds that will be 
returned while under warranty? 

2. What is the probability that a randomly selected purchaser has earbuds that will be returned while 
under warranty? 

3. If a customer returns earbuds under warranty, what is the probability that they are brand 1 
earbuds? Brand 2? Brand 3? 


The first stage of the problem involves a customer selecting one of the three brands of earbud. Let 
A; = {brand 7 is purchased}, for i = 1, 2, and 3. Then P(A,) = .50, P(A2) = .30, and P(A3) = .20. 
Once a brand of earbud is selected, the second stage involves observing whether the selected earbuds 
get returned during the warranty period. With B = {returned} and B’ = {not returned}, the given 
information implies that P(B|A,) = .25, P(B|A2) = .20, and P(BIA3) = .10. 

The tree diagram representing this experimental situation appears in Figure 2.11 (p. 80). The initial 
branches correspond to different brands of earbuds; there are two second-generation branches ema- 
nating from the tip of each initial branch, one for “returned” and the other for “not returned.” The 
probability P(A;) appears on the ith initial branch, whereas the conditional probabilities P(B|A;) and 
P(B'|A;) appear on the second-generation branches. To the right of each second-generation branch 
corresponding to the occurrence of B, we display the product of probabilities on the branches leading 
out to that point. This is simply the Multiplication Rule in action. The answer to question 1 is thus 
P(A, MB) = P(BIA,) - P(A;) = .125. The answer to question 2 is 


P(B) = P{(brand 1 and returned) or (brand 2 and returned) or (brand 3 and returned)] 
a P(A\ NB) + P(A2MB) + P(A; NB) 
= .125+ .060+ .020 = .205 


Finally, 


P(A\|B) _P(ALNB) _ 125 


P(B) 205 
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P(B\A,)- P(A\) = P(BNA)) = -125 


P(B|A)+ P(Ag) = P(BNA3) = .060 


P(B|A3)+ P(A3) = P(BNA3) = -020 


P(B) = .205 


Figure 2.11 Tree diagram for Example 2.30 


P(A2MB) — .060 


and 
P(A3|B) = 1 — P(A,|B) — P(A2|B) = .10 


Notice that the initial or prior probability of brand 1 is .50, whereas once it is known that the 
selected earbuds were returned, the posterior probability of brand 1 increases to .61. This is because 
brand | earbuds are more likely to be returned under warranty than are the other brands. In contrast, 
the posterior probability of brand 3 is P(A3|B) = .10, which is much less than the prior probability 
P(A3) = .20. le 


The Law of Total Probability and Bayes’ Theorem 

The computation of a posterior probability P(A,|B) from given prior probabilities P(A;) and condi- 
tional probabilities P(B|A,) occupies a central position in elementary probability. The general rule for 
such computations, which is really just a simple application of the Multiplication Rule, goes back to 
the Reverend Thomas Bayes, who lived in the eighteenth century. To state it we first need another 
result. Recall that events Ay, ..., A, are mutually exclusive if no two have any common outcomes. The 
events are exhaustive if A; U---UAx=¥8, so that one A; must occur. 


LAW OF TOTAL Let Aj, ..., Ax be mutually exclusive and exhaustive events. Then for any 
PROBABILITY other event B, 
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(2.5) 


Proof Because the A;’s are mutually exclusive and exhaustive, if B occurs it must be in conjunction with 
exactly one of the A,’s. That is, B = (A; and B) or ... or (A, and B) = (Ay NB) U --- U (Ag M B), where 
the events (A; B) are mutually exclusive. This “partitioning of B” is illustrated in Figure 2.12. Thus 


k k 


P(B) = S > P(AiNB) = S| P(BIA;)P(Ai) 


i=l i=l 


as desired. 


Figure 2.12 Partition of B by mutually exclusive and exhaustive A,’s | 


An example of the use of Equation (2.5) appeared in answering question 2 of Example 2.30, where 
A, = {brand 1}, A> = {brand 2}, A3 = {brand 3}, and B = {returned}. 


Example 2.31 A certain university has three colleges: Letters & Science (45% of the student body), 
Business (32%), and Engineering (23%). Of the students in the College of Letters & Science, 11% 
traveled out of state during the most recent spring break, compared to 14% in Business and just 3% in 
Engineering. If we select a student completely at random from this student body, what’s the prob- 
ability he/she traveled out of state for spring break? 

Define A; = {the student belongs to Letters & Science}; define Az and A; similarly for Business 
and Engineering, respectively. Let B = {the student traveled out of state for spring break}. The 
percentages provided above imply that 


P(A1)=.45 P(Ap) =.32—-P(A3) = .23 
P(B|A\) =.11 P(B\A2) =.14 P(BIA3) = .03 


Notice that A,, A2, A3 form a partition of the sample space (the student body). Apply the Law of Total 
Probability: 


P(B) = (.11)(.45) + (.14)(.32) + (.03)(.23) = 1012 | 
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BAYES’ THEOREM Let Ay, ..., A, be a collection of mutually exclusive and exhaustive events 
with P(A;) > 0 for i= 1, ..., &. Then for any other event B for which 
P(B) > 0, 
P(AjNB) P(BIA})P(Aj) 


P(Aj|B) = PB) ~ 5, PBIA)P(A) pstauk ~@6) 


The transition from the second to the third expression in (2.6) rests on using the Multiplication Rule 
in the numerator and the Law of Total Probability in the denominator. 

The proliferation of events and subscripts in (2.6) can be a bit intimidating to probability new- 
comers. When k= 2, so that the partition of S consists of just A; =A and A, =A’, Bayes’ 
Theorem becomes 


P(A)P(BIA) 
P(A)P(BIA) + P(A’)P(B|A’) 


P(A|B) = 


As long as there are relatively few events in the partition, a tree diagram (as in Example 2.30) can be 
used as a basis for calculating posterior probabilities without ever referring explicitly to Bayes’ 
theorem. 


Example 2.32 Incidence of a rare disease. Only 1 in 1000 adults is afflicted with a rare disease for 
which a diagnostic test has been developed. The test is such that when an individual actually has the 
disease, a positive result will occur 99% of the time, whereas an individual without the disease will 
show a positive test result only 2% of the time. If a randomly selected individual is tested and the 
result is positive, what is the probability that the individual has the disease? 

[Note: The sensitivity of this test is 99%, whereas the specificity—how specific positive results are 
to this disease—is 98%. As an indication of the accuracy of medical tests, an article in the October 29, 
2010, New York Times reported that the sensitivity and specificity for a new DNA test for colon 
cancer were 86% and 93%, respectively. The PSA test for prostate cancer has sensitivity 85% and 
specificity about 30%, while the mammogram for breast cancer has sensitivity 75% and specificity 
92%. And then there are Covid19 tests. All tests are less than perfect.] 


To use Bayes’ theorem, let A, = {individual has the disease}, Az = {individual does not have 
the disease}, and B = {positive test result}. Then P(A;) = .001, P(A2) = .999, P(B|A,) = .99, and 
P(B|Az) = .02. The tree diagram for this problem is in Figure 2.13 (p. 83). 

Next to each branch corresponding to a positive test result, the Multiplication Rule yields the 
recorded probabilities. Therefore, P(B) = .00099 + .01998 = .02097, from which we have 


P(ALNB) _ 00099 _ 


Pee) P(B) 02097 


This result seems counterintuitive; because the diagnostic test appears so accurate, we expect 
someone with a positive test result to be highly likely to have the disease, whereas the computed 
conditional probability is only .047. However, because the disease is rare and the test isn’t perfectly 
reliable, most positive test results arise from errors rather than from diseased individuals. The 
probability of having the disease has increased by a multiplicative factor of 47 (from prior .001 to 
posterior .047); but to get a further increase in the posterior probability, a diagnostic test with much 


2.4 Conditional Probability 83 


P(A,NB) = .00099 


P(AyNM B) = .01998 


= ey 


Figure 2.13 Tree diagram for the rare disease problem 


smaller error rates is needed. If the disease were not so rare (e.g., 25% incidence in the population), 
then the error rates for the present test would provide good diagnoses. 

This example shows why it makes sense to be tested for a rare disease only if you are in a high-risk 
group. For example, most of us are at low risk for HIV infection, so testing would not be indicated, 
but those who are in a high-risk group should be tested for HIV. For some diseases the degree of risk 
is strongly influenced by age. Young women are at low risk for breast cancer and should not be tested, 
but older women do have increased risk and need to be tested. There is some argument about where to 
draw the line. If we can find the incidence rate for our group and the sensitivity and specificity for the 
test, then we can do our own calculation to see if a positive test result would be informative. MH 


An important contemporary application of Bayes’ theorem is in the identification of spam e-mail 
messages. A nice expository article on this appears in Statistics: A Guide to the Unknown (see the 
bibliography). 


Exercises: Section 2.4 (49-73) 


49. The population of a particular country a. Calculate P(A), P(C), and P(AM C). 
consists of three ethnic groups. Each indi- b. Calculate both P(A|C) and P(C|A) and 
vidual belongs to one of the four major explain in context what each of these 
blood groups. The accompanying joint probabilities represents. 
probability table gives the proportions of c. If the selected individual does not have 
individuals in the various ethnic group— type B blood, what is the probability 
blood group combinations. that he or she is from ethnic group 1? 

50. Suppose an individual is randomly selected 

Blpoderoup from the population of all adult males liv- 

Ethnic group o sis is sia ing in the United States. Let A be the event 

! ee ae 08 is that the selected individual is over 6 ft in 

2 135 141 018 006 ; 

3 215 200 065 020 height, and let B be the event that the 

selected individual is a professional bas- 

Suppose that an individual is randomly ketball player. Which do you think is lar- 
selected from the population, and define ger, P(A|B) or P(BJA)? Why? 

events by A = {type A selected}, B = {type 51. Return to the credit card scenario of Exercise 

B selected}, and C= {ethnic group 3 14, where A = {Visa}, B = {MasterCard}, 


selected}. P(A) = .5, P(B) = .4, and P(ANMB) = .25. 
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52. 


53. 


54. 


Calculate and interpret each of the following 

probabilities (a Venn diagram might help). 

a. P(B|A) 

b. P(B’|A) 

c. P(A|B) 

. P(A’|B) 

. Given that the selected individual has at 
least one card, what is the probability 
that he or she has a Visa card? 


o fm 


Reconsider the system defect situation 
described in Exercise 28. 


a. Given that the system has a type 1 
defect, what is the probability that it has 
a type 2 defect? 

b. Given that the system has a type 1 
defect, what is the probability that it has 
all three types of defects? 

c. Given that the system has at least one 
type of defect, what is the probability 
that it has exactly one type of defect? 

d. Given that the system has both of the 
first two types of defects, what is the 
probability that it does not have the third 
type of defect? 


If two bulbs are randomly selected from the 
box of lightbulbs described in Exercise 42 
and at least one of them is found to be rated 
75 W, what is the probability that both of 
them are 75-W bulbs? Given that at least 
one of the two selected is not rated 75 W, 
what is the probability that both selected 
bulbs have the same rating? 

A department store sells sport shirts in three 
sizes (small, medium, and large), three 
patterns (plaid, print, and stripe), and two 
sleeve lengths (long and_ short). The 
accompanying tables give the proportions 
of shirts sold in the various category com- 
binations. 


Short-sleeved 


Pattern 


Size Pl Pr St 


04 .02 .05 
08 07 12 
.03 07 .08 
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Long-sleeved 


Pattern 
Size Pl Pr St 
Ny 03 02 03 
M 10 05 07 
L 04 02 08 


55. 


56. 


a. What is the probability that the next 
shirt sold is a medium, long-sleeved, 
print shirt? 

b. What is the probability that the next 
shirt sold is a medium print shirt? 

c. What is the probability that the next 
shirt sold is a short-sleeved shirt? A 
long-sleeved shirt? 

d. What is the probability that the size of the 
next shirt sold is medium? That the pat- 
tern of the next shirt sold is a print? 

e. Given that the shirt just sold was a 
short-sleeved plaid, what is the proba- 
bility that its size was medium? 

f. Given that the shirt just sold was a 
medium plaid, what is the probability 
that it was short-sleeved? Long-sleeved? 


One box contains six red balls and four 
green balls, and a second box contains 
seven red balls and three green balls. A ball 
is randomly chosen from the first box and 
placed in the second box. Then a ball is 
randomly selected from the second box and 
placed in the first box. 


a. What is the probability that a red ball is 
selected from the first box and a red ball 
is selected from the second box? 

b. At the conclusion of the selection pro- 
cess, what is the probability that the 
numbers of red and green balls in the 
first box are identical to the numbers at 
the beginning? 

A system consists of two identical pumps, 

#1 and #2. If one pump fails, the system will 

still operate. However, because of the added 

strain, the extra remaining pump is now 
more likely to fail than was originally 
the case. That is, r = P(#2 fails|#1 fails) > 

P(#2 fails) = q. If at least one pump fails by 
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D/. 


58. 


59. 


60. 
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the end of the pump design life in 7% of all 
systems and both pumps fail during that 
period in only 1%, what is the probability 
that pump #1 will fail during the pump 
design life? 


A certain shop repairs both audio and video 
components. Let A denote the event that the 
next component brought in for repair is an 
audio component, and let B be the event 
that the next component is an MP3 player 
(so the event B is contained in A). Suppose 
that P(A) = .6 and P(B) = .05. What is 
P(B\A)? 

In Exercise 15, A; = {awarded project i}, 
for i = 1, 2, 3. Use the probabilities given 
there to compute the following probabili- 
ties, and explain in words the meaning of 
each one. 


a. P(A2|A1) 
b. P(A, A3|A1) 
Cc. P(A2 UA3|A1) 


d. P(A, MA2MA3|A, UA? UA3) 


Refer back to the culture website scenario 
in Example 2.27. 


a. Given that someone regularly reads at 
least one of the three sections listed 
(Arts, Books, Cinema), what is the 
probability she reads all three? 

b. Given that someone regularly reads all 
three sections, what is the probability 
she reads at least one? [Think carefully!] 


Three plants manufacture hard drives and 
ship them to a warehouse for distribution. 
Plant I produces 54% of the warehouse’s 
inventory with a 4% defect rate. Plant II 
produces 35% of the warehouse’s inventory 
with an 8% defect rate. Plant III produces 
the remainder of the warehouse’s inventory 
with a 12% defect rate. 


a. Draw a tree diagram to represent this 
information. 

b. A warehouse inspector selects one hard 
drive at random. What is the probability 
that it is a defective hard drive and from 
Plant II? 


61. 


62. 


63. 


64. 


65. 
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c. What is the probability that a randomly 
selected hard drive is defective? 

d. Suppose a hard drive is defective. What 
is the probability that it came from Plant 
II? 

For any events A and B with P(B) > 0, 

show that P(A | B) + P(A'| B) = 1. 

If P(B|A) > P(B) show that P(B’|A)< 

P(B’). [Hint: Add P(B’ | A) to both sides of 

the given inequality, and then use the result 

of the previous exercise. ] 


Show that for any three events A, B, and 
C with P(C) > 0, P(AUB|C) = P(A|C)+ 
P(B|C)— P(ANB|C). 

At a gas station, 40% of the customers use 
regular gas (A;), 35% use midgrade gas 
(A2), and 25% use premium gas (A;3). Of 
those customers using regular gas, only 
30% fill their tanks (event B). Of those 
customers using midgrade gas, 60% fill 
their tanks, whereas of those using pre- 
mium, 50% fill their tanks. 


a. What is the probability that the next 
customer will request midgrade gas and 
fill the tank (A.B)? 

b. What is the probability that the next 
customer fills the tank? 

c. If the next customer fills the tank, what 
is the probability that regular gas is 
requested? Midgrade gas? Premium 
gas? 

Seventy percent of the light aircraft that 

disappear while in flight in a certain coun- 

try are subsequently discovered. Of the 
aircraft that are discovered, 60% have an 
emergency locator, whereas 90% of the 

aircraft not discovered do not have such a 

locator. Suppose a light aircraft has 

disappeared. 


a. If it has an emergency locator, what is 
the probability that it will not be 
discovered? 

b. If it does not have an emergency loca- 
tor, what is the probability that it will be 
discovered? 
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Components of a certain type are shipped to 
a supplier in batches of ten. Suppose that 
50% of all such batches contain no defective 
components, 30% contain one defective 
component, and 20% contain two defective 
components. Two components from a batch 
are randomly selected and tested. What are 
the probabilities associated with 0, 1, and 2 
defective components being in the batch 
under each of the following conditions? 


a. Neither tested component is defective. 
b. One of the two tested components is 
defective. 
[Hint: Draw a tree diagram with three 
first-generation branches for the three 
different types of batches. ] 


Verify the multiplication rule for conditional 
probabilities: P(ANB|C) = P(A|BNC) - P(BIC). 

For customers purchasing a full set of tires 
at a particular tire store, consider the events 


A = {tires purchased were made in the 
United States} 


B={purchaser has tires balanced 
immediately } 
C = {purchaser requests front-end  align- 


ment} 


along with A’, B’, and C’. Assume the 
following unconditional and conditional 
probabilities: 


P(A) =.75 P(B|A)=.9 P(B|A') =.8 
P(C|ANB) =.8 P(C|ANB’) = 6 
P(C|A'NB) =.7 P(C|A'NB’) = 3 


a. Construct a tree diagram consisting of 
first-, second-, and_ third-generation 
branches, and place an event label and 
appropriate probability next to each 
branch. 

. Compute P(AN BNC). 

. Compute P(BNC). 

. Compute P(C). 

. Compute P(A|BMC), the probability of 
a purchase of U.S. tires given that both 
balancing and an alignment were 
requested. 
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A professional organization (for statisti- 
cians, of course) sells term life insurance 
and major medical insurance. Of those who 
have just life insurance, 70% will renew 
next year, and 80% of those with only a 
major medical policy will renew next year. 
However, 90% of policyholders who have 
both types of policy will renew at least one 
of them next year. Of the policyholders 
75% have term life insurance, 45% have 
major medical, and 20% have both. 


a. Calculate the percentage of policyhold- 
ers that will renew at least one policy 
next year. 

b. If a randomly selected policyholder 
does in fact renew next year, what is the 
probability that he or she has both types 
of policies? 


At a large university, in the never-ending 
quest for a satisfactory textbook, the 
Statistics Department has tried a different 
text during each of the last three quarters. 
During the fall quarter, 500 students used 
the text by Professor Mean; during the 
winter quarter, 300 students used the text 
by Professor Median; and during the spring 
quarter, 200 students used the text by Pro- 
fessor Mode. A survey at the end of each 
quarter showed that 200 students were 
satisfied with Mean’s book, 150 were sat- 
isfied with Median’s book, and 160 were 
satisfied with Mode’s book. If a student 
who took statistics during one of these 
quarters is selected at random and admits to 
having been satisfied with the text, is the 
student most likely to have used the book 
by Mean, Median, or Mode? Who is the 
least likely author? [Hint: Draw a tree dia- 
gram or use Bayes’ theorem.] 


A friend who lives in Los Angeles makes 
frequent consulting trips to Washington, 
D.C.; 50% of the time she travels on airline 
#1, 30% of the time on airline #2, and the 
remaining 20% of the time on airline #3. 
For airline #1, flights are late into D.C. 30% 
of the time and late into L.A. 10% of the 
time. For airline #2, these percentages are 
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25 and 20%, whereas for airline #3 the 
percentages are 40 and 25%. If we learn 
that on a particular trip she arrived late at 
exactly one of the two destinations, what 
are the posterior probabilities of having 
flown on airlines #1, #2, and #3? Assume 
that the chance of a late arrival in L.A. is 
unaffected by what happens on the flight to 
D.C. [Hint: From the tip of each first- 
generation branch on a tree diagram, draw 
three second-generation branches labeled, 
respectively, O late, 1 late, and 2 late.] 
Suppose a single gene controls the color of 
hamsters: black (B) is dominant and brown 
(b) is recessive. Hence, a hamster will be 
black unless its genotype is bb. Two ham- 
sters, each with genotype Bb, mate and 
produce a single offspring. The laws of 
genetic recombination state that each parent 
is equally likely to donate either of its two 
alleles (B or b), so the offspring is equally 
likely to be any of BB, Bb, bB, or bb (the 
middle two are genetically equivalent). 
a. What is the probability their offspring 
has black fur? 
b. Given that their offspring has black fur, 
what is the probability its genotype is 
Bb? 


Refer back to the scenario of the previous 
exercise. In the figure below, the genotypes 
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of both members of Generation I are 
known, as is the genotype of the male 
member of Generation II. We know that 
hamster II2 must be black-colored thanks to 
her father, but suppose that we don’t know 
her genotype exactly (as indicated by B in 
the figure). 


Generation I (2) BB 
Generation II Bb (2-) 

2 
Generation III C) 


a. What are the possible genotypes of 
hamster II2, and what are the corre- 
sponding probabilities? 

b. If we observe that hamster II1 has a 
black coat (and hence at least one 
B gene), what is the probability her 
genotype is Bb? 

c. If we later discover (through DNA 
testing on poor little hamster II1) that 
her genotype is BB, what is the posterior 
probability that her mom is also BB? 


The definition of conditional probability enables us to revise the probability P(A) originally assigned 
to A when we are subsequently informed that another event B has occurred; the new probability of 
A is P(A|B). In our examples, it was frequently the case that P(A|B) differed from the unconditional 
probability P(A), indicating that the information “B has occurred” resulted in a change in the chance 
of A occurring. There are other situations, though, in which the chance that A will occur or has 
occurred is not affected by knowledge that B has occurred, so that P(A|B) = P(A). It is then natural to 
think of A and B as independent events, meaning that the occurrence or nonoccurrence of one event 


has no bearing on the chance that the other will occur. 
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DEFINITION Two events A and B are independent if P(A|B) = P(A) and are dependent otherwise. 


The definition of independence might seem “unsymmetrical” because we do not demand that P(B|A) = 
P(B) also. However, using the definition of conditional probability and the Multiplication Rule, 


P(AMB)  P(A|B)P(B) 
P(BIA) = = 2.7 
The right-hand side of Equation (2.7) is P(B) if and only if P(A|B) = P(A) (independence), so the 
equality in the definition implies the other equality (and vice versa). It is also straightforward to show 
that if A and B are independent, then so are the following pairs of events: (1) A’ and B, (2) A and B’, 
and (3) A’ and B' (see Exercise 77). 


Example 2.33 Consider an ordinary deck of 52 cards comprised of the four “suits” spades, hearts, 
diamonds, and clubs, with each suit consisting of the 13 ranks ace, king, queen, jack, ten, ..., and two. 
Suppose someone randomly selects a card from the deck and reveals to you that it is a face card (that 
is, a king, queen, or jack). What now is the probability that the card is a spade? If we let A = {spade} 
and B = {face card}, then P(A) = 13/52, P(B) = 12/52 (there are three face cards in each of the four 
suits), and P(AM B) = P(spade and face card) = 3/52. Thus 


P(ANB) 3/52 3 1 13 
P(B) 12/52 «12 «4 +«O52 


P(A|B) = 


P(A) 


Therefore, the likelihood of getting a spade is not affected by knowledge that a face card had been selected. 
Intuitively this is because the fraction of spades among face cards (3 out of 12) is the same as the fraction of 
spades in the entire deck (13 out of 52). It is also easily verified that P(BJA) = P(B), so knowledge that a 
spade has been selected does not affect the likelihood of the card being a jack, queen, or king. a 


Example 2.34 Let A and B be any two mutually exclusive events with P(A) > 0. For example, for a 
randomly chosen automobile, let A = {car is blue} and B= {car is red}. Since the events are 
mutually exclusive, if B occurs, then A cannot possibly have occurred, so P(A|B) = 0 # P(A). The 
message here is that if two events are mutually exclusive, they cannot be independent. When A and 
B are mutually exclusive, the information that A occurred says something about B (it cannot have 
occurred), so independence is precluded. a 


P(A‘ B) When Events Are Independent 

Frequently the nature of an experiment suggests that two events A and B should be assumed inde- 
pendent. This is the case, for example, if a manufacturer receives a circuit board from each of two 
different suppliers, each board is tested on arrival, and A = {first is defective} and B = {second is 
defective}. If P(A) = .1, it should also be the case that P(A|B) = .1; knowing the condition of the 
second board shouldn’t provide information about the condition of the first. Our next result shows 
how to compute P(AMB) when the events are independent. 


PROPOSITION _ A and B are independent if and only if 


P(ANB) = P(A) - P(B) (2.8) 
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Proof By the Multiplication Rule, P(AM B) = P(A|B) - P(B), and this equals P(A) - P(B) if and only 
if P(A|B) = P(A). a 


Because of the equivalence of independence with Equation (2.8), the latter can be used as a 
definition of independence. ' 


Example 2.35 It is known that 3% of a certain machine tool manufacturing company’s band saws 
break down within the first six months of ownership, compared to only 1% of its industrial lathes. If a 
machine shop purchases both a band saw and a lathe made by this company, what is the probability 
that both machines will break down within six months? 

Let A denote the event that the band saw breaks down in the first six months, and define 
B analogously for the industrial lathe. Then P(A) = .03 and P(B) = .01. Assuming that the two 
machines function independently of each other, the desired probability is 


P(ANB) = P(A) - P(B) = (.03)(.01) = .0003 
The probability that neither machine breaks down in that time period is 
P(A’ B’) = P(A‘) - P(B’) = (.97)(.99) = .9603 


Note that, although the independence assumption is reasonable here, it can be questioned. In particular, 
if heavy use causes a breakdown in one machine, it could also cause trouble for the other one. H& 


Example 2.36 Each day, Monday through Friday, a batch of components sent by a first supplier 
arrives at a certain inspection facility. Two days a week, a batch also arrives from a second supplier. 
Eighty percent of all supplier 1’s batches pass inspection, and 90% of supplier 2’s do likewise. What 
is the probability that, on a randomly selected day, two batches pass inspection? We will answer this 
assuming that on days when two batches are tested, whether the first batch passes is independent of 
whether the second batch does so. Figure 2.14 displays the relevant information. 


+ 4x (8 x 9) 


a: ails 


Figure 2.14 Tree diagram for Example 2.36 


‘However, the multiplication property is satisfied if P(B) = 0, yet P(A|B) is not defined in this case. To make the 
multiplication property completely equivalent to the definition of independence, we should append to that definition that 
A and B are also independent if either P(A) = 0 or P(B) = 0. 
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P(two pass) = P(two received both pass) 


l| 


P(both pass|two received) - P(two received) 
[(.8)(.9)](.4) = .288 a 


Independence of More Than Two Events 

The notion of independence of two events can be extended to collections of more than two events. 
Although it is possible to extend the definition for two independent events by working in terms of 
conditional and unconditional probabilities, it is more direct and less cumbersome to proceed along 
the lines of the last proposition. 


DEFINITION Events A,, ..., A, are mutually independent if for every k (k = 2, 3, ..., n) and 
every subset of distinct indices ij, iz, ..., ix, 


P(Ai, NAA — NAi,) = P(Ai,) - P(Aj,) uae “PUA ) 


To paraphrase the definition, the events are mutually independent if the probability of the intersection 
of any subset of the m events is equal to the product of the individual probabilities. In using this 
multiplication property for more than two independent events, it is legitimate to replace one or more 
of the A;’s by their complements (e.g., if A,;, Az, and A3 are independent events, then so are 
A‘, A5, and A‘). As was the case with two events, we frequently specify at the outset of a problem 
the independence of certain events. The definition can then be used to calculate the probability of an 
intersection. 


Example 2.37 The article “Reliability Evaluation of Solar Photovoltaic Arrays” (Solar Energy 
2002: 129-141) presents various configurations of solar photovoltaic arrays consisting of crystalline 
silicon solar cells. Consider first the system illustrated in Figure 2.15a. There are two subsystems 
connected in parallel, each one containing three cells. In order for the system to function, at least one 
of the two parallel subsystems must work. Within each subsystem, the three cells are connected in 
series, so a subsystem will work only if all cells in the subsystem work. Consider a particular lifetime 
value fo, and suppose we want to determine the probability that the system lifetime exceeds fp. Let A; 
denote the event that the lifetime of cell i exceeds fp (i = 1, 2, ..., 6). We assume that the A;’s are 
independent events (whether any particular cell lasts more than fy hours has no bearing on whether 
any other cell does) and that P(A;) = .9 for every i since the cells are all made the same way. 


(a) (b) 


4 5 6 4 5 6 


Figure 2.15 System configurations for Example 2.37: (a) series—parallel; (b) total-cross-tied 
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P(system lifetime exceeds fo) = P[(Ay NA2MA3) U (Ag NAs MAo)| 
_ P(A; NA2 NA3) + P(A NAs5 Ae) 
= PI(Ay NA2 MA3) M (Aq As MAe)| 


= .927 


Alternatively, 


(.9)(.9)(.9) + (.9)(.9)(.9) — (.9)(.9) (9) (.9) (.9) (.9) 


P(system lifetime exceeds t9) = 1 — P(both subsystem lives are < fo) 


= | — [P(subsystem life is < to)]? 


= 1 —[1 — P(subsystem life is > to)|* 
=1-[1-.9? =.927 


Next consider the total-cross-tied system shown in Figure 2.15b, obtained from the series—parallel 
array by connecting ties across each column of junctions. Now the system fails as soon as an entire 
column fails, and system lifetime exceeds fg only if the life of every column does so. For this 
configuration, 


P(system lifetime exceeds to) = [P(column lifetime exceeds fo)]* 


Exercises: Section 2.5 (74—92) 


74. Reconsider the credit card scenario of 


YD: 


Exercise 51 (Section 2.4), and show that 
A and B are dependent first by using the 
definition of independence and then by 
verifying that the multiplication property 
does not hold. 


An oil exploration company currently has 
two active projects, one in Asia and the 
other in Europe. Let A be the event that the 
Asian project is successful and B be the 
event that the European project is success- 
ful. Suppose that A and B are independent 
events with P(A) = .4 and P(B) = .7. 


a. If the Asian project is not successful, 
what is the probability that the Euro- 
pean project is also not successful? 
Explain your reasoning. 
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1 — P(column lifetime is < t)|* 
1 — P(both cells in a column have lifetime < to)]* 
1 .970 a 


b. What is the probability that at least one 
of the two projects will be successful? 

c. Given that at least one of the two pro- 
jects is successful, what is the proba- 
bility that only the Asian project is 
successful? 


In Exercise 15, is any A; independent of any 
other A;? Answer using the multiplication 
property for independent events. 


If A and B are independent events, show 
that A’ and B are also independent. [Hint: 
First establish a _ relationship among 
P(A’ B), P(B), and P(ANB).] 

Suppose that the proportions of blood 
phenotypes in a particular population are as 
follows: 
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Assuming that the phenotypes of two ran- 
domly selected individuals are independent 
of each other, what is the probability that 
both phenotypes are O? What is the prob- 
ability that the phenotypes of two randomly 
selected individuals match? 


The probability that a grader will make a 
marking error on any particular question of 
a multiple-choice exam is .1. If there are ten 
questions and questions are marked inde- 
pendently, what is the probability that no 
errors are made? That at least one error is 
made? If there are nm questions and the 
probability of a marking error is p rather 
than .1, give expressions for these two 
probabilities. 


An aircraft seam requires 25 rivets. The 
seam will have to be reworked if any of 
these rivets is defective. Suppose rivets are 
defective independently of one another, 
each with the same probability. 


a. If 20% of all seams need reworking, 
what is the probability that a rivet is 
defective? 

b. How small should the probability of a 
defective rivet be to ensure that only 
10% of all seams need reworking? 


A boiler has five identical relief valves. The 
probability that any particular valve will 
open on demand is .95. Assuming inde- 
pendent operation of the valves, calculate 
P(at least one valve opens) and P(at least 
one valve fails to open). 


Two pumps connected in parallel fail 
independently of each other on any given 
day. The probability that only the older 
pump will fail is .10, and the probability 
that only the newer pump will fail is .05. 
What is the probability that the pumping 
system will fail on any given day (which 
happens if both pumps fail)? 


Consider the system of components con- 
nected as in the accompanying picture. 
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Components | and 2 are connected in par- 
allel, so that subsystem works if and only if 
either 1 or 2 works; since 3 and 4 are con- 
nected in series, that subsystem works if 
and only if both 3 and 4 work. If compo- 
nents work independently of one another 
and P(component works) = .9, calculate 
P(system works). 


Refer back to the series—parallel system 
configuration introduced in Example 2.37, 
and suppose that there are only two cells 
rather than three in each parallel subsystem 
[in Figure 2.15a, eliminate cells 3 and 6, 
and re-number cells 4 and 5 as 3 and 4]. 
Using P(A;) = .9, the probability that sys- 
tem lifetime exceeds fg is easily seen to be 
.9639. To what value would .9 have to be 
changed in order to increase the system 
lifetime reliability from .9639 to .99? [Hint: 
Let P(A;) = p, express system reliability in 
terms of p, and then let x = p’ | 


Consider independently rolling two fair 
dice, one red and the other green. Let A be 
the event that the red die shows 3 dots, B be 
the event that the green die shows 4 dots, 
and C be the event that the total number of 
dots showing on the two dice is 7. 


a. Are these events pairwise independent 
(i.e., are A and B independent events, 
are A and C independent, and are B and 
C independent)? 

b. Are the three 
independent? 


events mutually 


Components arriving at a distributor are 
checked for defects by two different 
inspectors (each component is checked by 
both inspectors). The first inspector detects 
90% of all defectives that are present, and 
the second inspector does likewise. At least 
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one inspector fails to detect a defect on 
20% of all defective components. What is 
the probability that the following occur? 


a. A defective component will be detected 
only by the first inspector? By exactly 
one of the two inspectors? 

b. All three defective components in a 
batch escape detection by both inspec- 
tors (assuming inspections of different 
components are independent of one 
another)? 


A quality control inspector is inspecting 
newly produced items for faults. The 
inspector searches an item for faults in a 
series of independent “fixations,” each of a 
fixed duration. Given that a flaw is actually 
present, let p denote the probability that the 
flaw is detected during any one fixation 
(this model is discussed in “Human Per- 
formance in Sampling Inspection,” Hum. 
Factors 1979: 99-105). 


a. Assuming that an item has a flaw, what 
is the probability that it is detected by 
the end of the second fixation (once a 
flaw has been detected, the sequence of 
fixations terminates)? 

b. Give an expression for the probability 
that a flaw will be detected by the end of 
the nth fixation. 

c. If when a flaw has not been detected in 
three fixations, the item is passed, what 
is the probability that a flawed item will 
pass inspection? 

d. Suppose 10% of all items contain a flaw 
[P(randomly chosen item is flawed) = .1]. 
With the assumption of part (c), what is 
the probability that a randomly chosen 
item will pass inspection (it will auto- 
matically pass if it is not flawed, but could 
also pass if it is flawed)? 

e. Given that an item has passed inspection 
(no flaws in three fixations), what is the 
probability that it is actually flawed? 
Calculate for p = .5. 
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a. A lumber company has just taken 
delivery on a lot of 10,000 2 x 4 boards. 
Suppose that 20% of these boards (2000) 
are actually too green to be used in first- 
quality construction. Two boards are 
selected at random, one after the other. 
Let A = {the first board is green} and 
B = {the second board is green}. Com- 
pute P(A), P(B), and P(AMB) (a tree 
diagram might help). Are A and 
B independent? 

b. With A and B independent and P(A) = 
P(B) = .2, what is P(ANB)? How 
much difference is there between this 
answer and P(AMB) in part (a)? For 
purposes of calculating P(AMB), can 
we assume that A and B of part (a) are 
independent to obtain essentially the 
correct probability? 

c. Suppose the lot consists of ten boards, 
of which two are green. Does the 
assumption of independence now yield 
approximately the correct answer for 
P(AMB)? What is the critical difference 
between the situation here and that of 
part (a)? When do you think that an 
independence assumption would be 
valid in obtaining an approximately 
correct answer to P(AM B)? 


Refer to the assumptions stated in Exercise 
83, and answer the question posed there for 
the system in the accompanying picture. 
How would the probability change if this 
were a subsystem connected in parallel to 
the subsystem pictured in Figure 2.15a? 


1 3 4 

> , 
2 B] 6 
Professor Stander Deviation can take one of 


two routes on his way home from work. On 
the first route, there are four railroad 


94 


91. 


crossings. The probability that he will be 
stopped by a train at any particular one of 
the crossings is .1, and trains operate 
independently at the four crossings. The 
other route is longer but there are only two 
crossings, independent of each other, with 
the same stoppage probability for each as 
on the first route. On a particular day, 
Professor Deviation has a meeting sched- 
uled at home for a certain time. Whichever 
route he takes, he calculates that he will be 
late if he is stopped by trains at least half 
the crossings encountered. 


a. Which route should he take to minimize 
the probability of being late to the 
meeting? 

b. If he tosses a fair coin to decide on a 
route and he is late, what is the proba- 
bility that he took the four-crossing 
route? 

Suppose identical tags are placed on both 

the left ear and the right ear of a fox. The 

fox is then let loose for a period of time. 

Consider the two events C, = {left ear tag 

is lost} and C2 = {right ear tag is lost}. Let 

p = P(C,) = P(C), and assume C, and C 

are independent events. Derive an expres- 

sion (involving p) for the probability that 
exactly one tag is lost given that at most 
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one is lost (“Ear Tag Loss in Red Foxes,” 
J. Wildlife Manag. 1976: 164-167). 


It’s a commonly held misconception that if 
you play the lottery n times, and the prob- 
ability of winning each time is 1/N, then 
your chance of winning at least once is n/N. 
That’s true if you buy 7 tickets in one week, 
but not if you buy a single ticket in each of 
n independent weeks. Let’s explore further. 


a. Suppose you play a game n independent 
times, with P(win) = 1/N each time. 
Find an expression for the probability 
you win at least once. [Hint: Consider 
the complement.] 

b. How does your answer to (a) compare 
to n/N for the easy task of rolling a 4 on 
a fair die (so N = 6) in n = 3 tries? In 
n = 6 tries? In n = 10 tries? 

c. Now consider a weekly lottery where 
you must guess the 6 winning numbers 


from 1 to 49, so N = ey If you play 


this lottery every week for a year 
(n = 52), how does your answer to 
(a) compare to n/N? 

d. Show that when n is much smaller than 
N, the fraction n/N is not a bad approx- 
imation to (a). [Hint: Use the binomial 
theorem from high school algebra.] 


As probability models in engineering and the sciences have grown in complexity, many problems 
have arisen that are too difficult to attack “analytically,” i.e., using just mathematical tools such as 
those in the previous sections. Instead, computer simulation provides us an effective way to estimate 
probabilities of very complicated events (and, in later chapters, of other properties of random phe- 
nomena). In this section, we introduce the principles of probability simulation, demonstrate a few 
examples with R code, and discuss the precision of simulated probabilities. 


Suppose an investigator wishes to determine P(A), but either the experiment on which A is defined 


or the A event itself is so complicated as to preclude the use of probability rules and properties. The 
general method for estimating this probability via computer simulation is as follows: 
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— Write a program that simulates (mimics) the underlying random experiment. 
— Run the program many times, with each run independent of all others. 
— During each run, record whether or not the event A of interest occurs. 


If the simulation is run a total of n independent times, then the estimate of P(A), denoted by P(A), 1S 


,,\ __ number of times A occurs _ n(A) 


P(A) = 


number of runs n 


For example, if we run a simulation program 10,000 times and the event of interest A occurs in 6174 


of those runs, then our estimate of P(A) is P(A) = 6174/10,000 = .6174. Notice that our definition is 
consistent with the long-run relative frequency interpretation of probability discussed in Section 2.2. 


The Backbone of Simulation: Random Number Generators 

All modern software packages are equipped with a function called a random number generator 
(RNG). A typical call to this function (such as ran or rand) will return a single, supposedly “random” 
number, though such functions typically permit the user to request a vector or even a matrix of 
“random” numbers. It is more proper to call these results pseudorandom numbers, since there is 
actually a deterministic (i.e., nonrandom) algorithm by which the software generates these values. We 
will not discuss the details of such algorithms here; see the book by Law listed in the references. What 
will matter to us are the following two characteristics: 


1. Each number created by an RNG is as likely to be any particular number in the interval [0, 1) as it 
is to be any other number in this interval (up to computer precision, anyway). 

2. Successive values created by RNGs are independent, in the sense that we cannot predict the next 
value to be generated from the current value (unless we somehow know the exact parameters of 
the underlying algorithm). 


A typical simulation program manipulates numbers on the interval [0, 1) in a way that mimics the 
experiment of interest; several examples are provided below. Arguably the most important building 
block for such programs is the ability to simulate a basic event that occurs with a known probability, 
p. Since RNGs produce values equally likely to be anywhere in the interval [0, 1), it follows that in 
the long run a proportion p of them will lie in the interval [0, p). So, suppose we need to simulate an 
event B with P(B) = p. In each run of our simulation program, we can call for a single “random” 
number, which we’ll call u, and apply the following rules: 


— If 0<u <p, then event B has occurred on this run of the program. 
— If p<u< 1, then event B has not occurred on this run of the program. 


Example 2.38 Let’s begin with an example in which the exact probability can be obtained ana- 
lytically, so that we may verify that our simulation method works. Suppose we have two independent 
devices which function with probabilities .6 and .7, respectively. What is the probability both devices 
function? That at least one device functions? 


7In the language of Chapter 4, the numbers produced by an RNG follow essentially a uniform distribution on the 
interval [0, 1). 
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Let B, and B, denote the events that the first and second devices function, respectively; we know 
that P(B,) = .6, P(B2) = .7, and B, and B, are independent. Our first goal is to estimate the probability 
of A = B, MBs, the event that both devices function. The following “pseudo-code” will allow us to 


obtain P(A). 


0. Set a counter for the number of times A occurs to zero. 
Repeat n times: 


1. Generate two random numbers, wu; and v2. (These will help us determine whether B, and B occur, 
respectively.) 
2. If uy < .6 AND wp < .7, then A has occurred. Add 1 to the count of occurrences of A. 

Once the n runs are complete, then P(A) = (count of the occurrences of A)/n. 

Figure 2.16 shows actual implementation code in R. We ran the program with n = 10,000 (as in 
the code) twice; the event A occurred 4218 times in the first run and 4157 the second time, providing 
estimated probabilities of P(A) = .4218 and .4157, respectively. Compare these to the exact proba- 
bility of A: by independence, P(A) = P(B,)P(B2) = (.6)(.7) = .42. Our simulation estimates were both 
“in the ballpark” of the right answer. We’ll discuss the precision of these estimates shortly. 


A=0 
for(i in 1:10000) { 
ul=runif (1); u2=runif (1) 
if(ul<.6 && u2<.7) { 
A=A+1 


} 


Figure 2.16 R code for Example 2.38 


By replacing the “and” operator && in the above code with “or” operator ||, we can estimate the 
probability at least one device functions, P(B,; U B2). In one simulation (again with n = 10,000), the 
event B, UB, occurred 8802 times, giving the estimate P(B, U Bz) = .8802. This is quite close to the 
exact probability: 


P(B, UBp) = 1 — P(B,NB,) =1— (1 — .6)(1—.7) = .88 a 


Example 2.39 Consider the following game: You’ll flip a coin 25 times, winning $1 each time it 
lands heads (H) and losing $1 each time it lands tails (7). Unfortunately for you, the coin is biased in 
such a way that P(H) = .4 and P(T) = .6. What’s the probability you come out ahead; i.e., you have 
more money at the end of the game than you had at the beginning? We’ll use simulation to find out. 

Now each run of the simulation requires 25 “random” objects: the results of the 25 coin tosses. 
What’s more, we need to keep track of how much money you have won or lost at the end of the 25 
tosses. Let A = {you come out ahead}, and use the following pseudo-code: 


0. Set a counter for the number of times A occurs to zero. 
Repeat n times: 


1. Set your initial dollar amount to zero. 
2. Generate 25 random numbers 4, ..., U5. 
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3. For each u; < .4, heads was tossed, so add 1 to your dollar amount. For each u; > .4, the flip was 
tails and 1 is deducted. 
4. If the final dollar amount is positive (i.e., $1 or greater), add 1 to the count of occurrences for A. 


Once the n runs are complete, then P(A) = (count of the occurrences of A)/n. 
R code for Example 2.39 appears in Figure 2.17. Our code gave a final count of 1567 occurrences 
of A, out of 10,000 runs. Thus, the estimated probability that you come out ahead in this game is 


P(A) = 1567/10,000 = .1567. 


A=0 
for (i an 1710000) 4 
dollar=0 
for (j in 1:25) { 
u=runif (1) 
if (u<.4){ 
dollar=dollartl 


} 
else{dollar=dollar-1} 


} 
if (dollar>0) { 
A=A+1 


Figure 2.17 R code for Example 2.39 | 


Readers familiar with basic programming will recognize that many “for” loops like those in the 
preceding examples can be sped up by vectorization, i.e., by using a function call that generates all the 
required random numbers simultaneously, rather than one at a time. Similarly, the if/else statements 
used in the preceding programs to determine whether a random number lies in an interval can be 
rewritten in terms of true/false bits, which automatically generate a | if a statement is true and a 0 
otherwise. For example, the R code 


can be replaced by the single line of code 


A=A+(u< .5); 


If the statement in parentheses is true, R assigns a value 1 to (u<.5), and so | is added to the 
count A. 

The previous two examples have both assumed independence of certain events: the functionality of 
neighboring devices, or the outcomes of successive coin flips. With the aid of some built-in functions 
within R, we can also simulate counting experiments similar to those in Section 2.3, even though 
draws without replacement from a finite population are not independent. To illustrate, let’s use 
simulation to estimate some of the combinatorial probabilities from Section 2.3. 
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Example 2.40 Consider again the situation presented in Example 2.24: A university warehouse has 
received a shipment of 25 laptops, of which 10 have AMD processors and 15 have Intel chips; a 
particular technician will check 6 of these 25 laptops, selected at random. Of interest is the probability 
of the event D3 = {exactly 3 of the 6 selected have Intel chips}. Although the initial probability of 
selecting a laptop with an Intel processor is 15/25, successive selections are not independent (the 
conditional probability that the next laptop also has an Intel chip is not 15/25). So, the method of the 
preceding examples does not apply. 
Instead, we use the sampling tool built into our software, as follows: 


0. Set a counter for the number of times D3 occurs to zero. 
Repeat n times: 


ae 


. Sample 6 numbers, without replacement, from the integers 1 through 25. (1-15 correspond to the 
labels for the 15 laptops with Intel processors, and 16—25 identify the 10 with AMD processors.) 

2. Count how many of these 6 numbers fall between 1 and 15, inclusive. 

3. If exactly 3 of these 6 numbers fall between 1 and 15, add 1 to the count of occurrences for D3. 

Once the n runs are complete, then P(D3) = (count of the occurrences of D3)/n. 

R code for this example appears in Figure 2.18. Vital to the execution of this simulation is the fact 
that R (like many statistical software packages) has a built-in mechanism for randomly sampling 
without replacement from a finite set of objects (here, the integers 1-25). For more information on 
this function, type help (sample) in R. 

In the code, the line sum (chips<=15) performs two actions. First, chips<=15 converts each 


D=0 

for (i in 1:10000) { 
chips=sample (25, 6) 
intel=sum(chips<=15) 
if (intel==3) { 

D=D+1 

} 

} 


Figure 2.18 R code for Example 2.40 


of the 6 numbers in the vector chips into a 1 if the entry is between 1 and 15 (and into a 0 
otherwise). Second, sum() adds up the 1’s and 0’s, which is equivalent to identifying how many 1’s 
appear (i.e., how many of the 6 numbers fell between 1 and 15). 

Our code resulted in event D3 occurring 3054 times, so P(D3) = 3054/10,000 = .3054, which is 
quite close to the “exact” answer of .3083 found in Example 2.24. The other probability of interest, 
the chance of randomly selecting at least 3 laptops with Intel processors, can be estimated by 
modifying one line of code: change intel==3 to intel>=3. One simulation provided a count of 
8522 occurrences in 10,000 trials, for an estimated probability of .8522 (close to the combinatorial 
solution of .8530). a 


Precision of Simulation 

In Example 2.38, we gave two different estimates P(A) for a probability P(A). Which is more 
“correct”? Without knowing P(A) itself, there’s no way to tell. However, thanks to the theory we will 
develop in subsequent chapters, we can quantify the precision of simulated probabilities. Of course, 
we must have written code that faithfully simulates the random experiment of interest. Further, we 
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assume that the results of each single run of our program are independent of the results of all other 
runs. (This generally follows from the aforementioned independence of computer-generated random 
numbers.) 


If this is the case, then a measure of the disparity between the true probability P(A) and the 
estimated probability P(A) based on n runs of the simulation is given by 


P(A)[L = P(A)] (2.9) 
n 

This measure of precision is called the (estimated) standard error of the estimate P(A); see Sec- 
tion 3.5 for a derivation. The standard error is analogous to the standard deviation from Chapter 1. 
Expression (2.9) tells us that the amount by which P(A) typically differs from P(A) depends upon two 
values: P(A) itself and the number of runs n. You can make sense of the former this way: if P(A) is very 
small, then P(A) will presumably be small as well, in which case they cannot deviate by very much 
since both are bounded below by zero. (Standard error quantifies the absolute difference between them, 
not the relative difference.) A similar comment applies if P(A) is very large, i.e., near 1. 

As for the relationship to n, Expression (2.9) indicates that the amount by which P(A) will 
typically differ from P(A) is inversely proportional to the square root of n. So, in particular, as 
n increases the tendency is for P(A) to vary less and less. This speaks to the precision of P(A): our 
estimate becomes more precise as n increases, but not at a very fast rate. 

Let’s think a bit more about this relationship: suppose your simulation results thus far were too 
imprecise for your taste. By how much would you have to increase the number of runs to gain one 
additional decimal place of precision? That’s equivalent to reducing the estimated standard error by a 
factor of 10. Since precision is proportional to 1/,/n, you would need to increase n by a factor of 100 
to achieve the desired improvement; e.g., if using n = 10,000 runs is insufficient for your purposes, 
then you’ll need 1,000,000 runs to get one additional decimal place of precision. Typically, this will 
mean running your program 100 times longer—not a big deal if 10,000 runs only take a nanosecond 
but prohibitive if they require, say, an hour. 


Example 2.41 (Example 2.39 continued) Based on n = 10,000 runs, we estimated the probability of 
coming out ahead in a certain game to be P(A) = .1567. Substituting into (2.9), we get 


.1567[1 — .1567] 
40,000. .0036 

This is the (estimated) standard error of our estimate .1567. We interpret as follows: some simulation 
experiments with n = 10,000 will result in an estimated probability that is within .0036 of the actual 
probability, whereas other such experiments will give an estimated probability that deviates by more 
than .0036 from the actual P(A); .0036 is roughly the size of a typical deviation between the estimate 
and what it is estimating. a 


In Chapter 8, we will return to the notion of standard error and develop a so-called confidence 
interval estimate for P(A): a range of numbers we can be very certain contains P(A). 
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Exercises Section 2.6 (93-112) 


93. 


94. 


95. 


96. 


Refer to Example 2.38. 


a. Modify the code in Figure 2.16 to esti- 
mate the probability that exactly one of 
the two devices functions properly. Then 
find the exact probability using the tech- 
niques from earlier sections of this 
chapter, and compare it to your estimated 
probability. 

b. Calculate the estimated standard error 
for the estimated probability in (a). 


Imagine you have five independently 
operating components, each working 
properly with probability .8. Use simulation 
to estimate the probability that 


a. All five components work properly. 

b. At least one of the five components 
works properly. 
[Hints for (a) and (b): You can adapt the 
code from Example 2.38, but the and/or 
statements will become tedious. Con- 
sider using the max and min functions 
instead. | 

c. Calculate the estimated standard errors 
for your answers in (a) and (b). 


Consider the system depicted in Exercise 
89. Assume the seven components operate 
independently with the following proba- 
bilities of functioning properly: .9 for 
components 1 and 2; .8 for each of com- 
ponents 3, 4, 5, 6; and .95 for component 7. 
Write a program to estimate the reliability 
of the system (i.e., the probability the sys- 
tem functions properly). 


You have an opportunity to answer six 
trivia questions about your favorite sports 
team, and you will win a pair of tickets to 
their next game if you can correctly answer 
at least three of the questions. Write a 
simulation program to estimate the chance 
you win the tickets under each of the fol- 
lowing assumptions. 


a. You have a 50-50 chance of getting any 
question right, independent of all others. 
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b. Being a true fan, you have a 75% 
chance of getting any question right, 
independent of all others. 

c. The first three questions are fairly easy, 
so you have a .75 chance of getting each 
of those right. However, the last three 
questions are much harder, and you only 
have a .3 probability of correctly answer 
each of those. 


In the game “Now or Then” on the televi- 
sion show The Price is Right, the contestant 
faces a wheel with 6 sectors. Each sector 
contains a grocery item and a price, and the 
contestant must decide whether the price is 
“now” (i.e., the item’s price the day of the 
taping) or “then” (the price at some speci- 
fied past date, such as September 2003). 
The contestant wins a prize (bedroom fur- 
niture, a Caribbean cruise, etc.) if he/she 
guesses correctly on three adjacent sectors. 
That is, numbering the sectors 1-6 clock- 
wise, correct guesses on sectors 5, 6, and 1 
wins the prize but not on sectors 5, 6, and 3, 
since the latter are not all adjacent. (The 
contestant gets to guess on all six sectors, if 
need be.) 

Write a simulation program to estimate the 
probability the contestant wins the prize, 
assuming her/his guesses are independent 
from item to item. Provide estimated 
probabilities under of the following 
assumptions: (1) each guess is “wild” and 
thus has probability .5 of being correct, and 
(2) the contestant is a good shopper, with 
probability .8 of being correct on any item. 


Refer to the game in Example 2.39. Under 
the same conditions as in that example, 
estimate the probability the player is ahead 
at any time during the 25 plays. [Hint: This 
occurs if the player’s dollar amount is 
positive at any of the 25 steps in the 
loop. So, you will need to keep track of 
every value of the dollar variable, not just 
the final result.] 
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Refer again to Example 2.39. Estimate the 
probability that the player experiences a 
“swing” of at least $5 during the game. 
That is, estimate the chance that the dif- 
ference between the largest and smallest 
dollar amounts during the game is at least 
5. (This would happen, for instance, if the 
player was at one point ahead at +$2 but 
later fell behind to —$3.) 


Teresa and Peter each have a fair coin. 
Teresa tosses her coin repeatedly until 
obtaining the sequence HTT. Peter tosses 
his coin until the sequence HTH is 
obtained. 


a. Write a program to simulate Teresa’s 
coin tossing and, separately, Peter’s. 
Your program should keep track of the 
number of tosses each author requires 
on each simulation run to achieve his 
target sequence. 

b. Estimate the probability that Peter 
obtains his sequence with fewer tosses 
than Teresa requires to obtain her 
sequence. 


A 40-question multiple-choice exam is 
sometimes administered in lower-level 
statistics courses. The exam has a peculiar 
feature: 10 of the questions have two 
options, 13 have three options, 13 have four 
options, and the other 4 have five options. 
(FYI, this is completely real!) What is the 
probability that, purely by guessing, a stu- 
dent could get at least half of these ques- 
tions correct? Write a simulation program 
to answer this question. 


Major League Baseball teams (usually) 
play a 162-game season, during which fans 
are often excited by long winning streaks 
and frustrated by long losing streaks. But 
how unusual are these streaks, really? How 
long a streak would you expect if the 
team’s performance were independent from 
game to game? 

Write a program that simulates a 162-game 
season, i.e., a string of 162 wins and losses, 
with P(win) = p for each game (the value 
of p to be specified later). Use your 
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program with at least 10,000 runs to answer 
the following questions. 


a. Suppose you’re rooting for a “.500” 
team—that is, p=.5. What is the 
probability of observing a streak of at 
least five wins in a 162-game season? 
Estimate this probability with your 
program, and include a standard error. 

b. Suppose instead your team is quite 
good: a .600 team overall, so p = .6. 
Intuitively, should the probability of a 
winning streak of at least five games be 
higher or lower? Explain. 

c. Use your program with p = .6 to esti- 
mate the probability alluded to in (b). Is 
your answer higher or lower than (a)? Is 
that what you anticipated? 


A derangement of the numbers | through 
n is a permutation of all n those numbers 
such that none of them is in the “right 
place.” For example, 34251 is a derange- 
ment of 1 through 5, but 24351 is not 
because 3 is in the 3rd position. We will use 
simulation to estimate the number of 
derangements of the numbers | through 12. 


a. Write a program that generates random 
permutations of the integers 1, 2, ..., 12. 
Your program should determine whe- 
ther or not each permutation is a 
derangement. 

b. Based on your program, estimate P(D), 
where D = {a permutation of 1-12 is a 
derangement}. 

c. From Section 2.3, we know the number 
of permutations of n items. (How many 
is that for n = 12?) Use this information 
and your answer to part (b) to estimate 
the number of derangements of the 
numbers | through 12. 

[Hint for (a): Use random sampling 
without replacement as in Example 
2.40.] 


The famous Birthday Problem was pre- 
sented in Example 2.22. Now suppose you 
have 500 Facebook friends. Make the same 
assumptions here as in the Birthday 
Problem. 
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a. Write a program to estimate the proba- 
bility that, on at least one day during the 
year, Facebook tells you three (or more) 
of your friends share that birthday. 
Based on your answer, should you be 
surprised by this occurrence? 

b. Write a program to estimate the proba- 

bility that, on at least one day during the 
year, Facebook tells you five (or more) 
of your friends share that birthday. 
Based on your answer, should you be 
surprised by this occurrence? 
[Hint: Generate 500 birthdays with 
replacement, and then determine whe- 
ther any birthday occurs three or more 
times (five or more for part (b)). The 
table function in R may prove useful.] 


Consider the following game: you begin 
with $20. You flip a fair coin, winning $10 
if the coin lands heads and losing $10 if the 
coin lands tails. Play continues until you 
either go broke or have $100 (i.e., a net 
profit of $80). Write a simulation program 
to estimate: 


a. The probability you win the game. 

b. The probability the game ends within 
ten coin flips. 
[Note: This is a special case of the 
Gambler’s Ruin problem.] 


Consider the Coupon Collector’s Problem: 
10 different coupons are distributed into 
cereal boxes, one per box, so that any 
randomly selected box is equally likely to 
have any of the 10 coupons inside. Write a 
program to simulate the process of buying 
cereal boxes until all 10 distinct coupons 
have been collected. For each run, keep 
track of how many cereal boxes you pur- 
chased to collect the complete set of cou- 
pons. Then use your program to answer the 
following questions. 


a. What is the probability you collect all 
10 coupons with just 10 cereal boxes? 

b. Use counting techniques to determine 
the exact probability in (a). [Hint: Relate 
this to the Birthday Problem.] 
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c. What is the probability you require 
more than 20 boxes to collect all 10 
coupons? 

d. Using techniques from Chapters 3 and 5, 
it can be shown that it takes about 29.3 
boxes, on the average, to collect all 10 
coupons. What’s the probability of col- 
lecting all 10 coupons in fewer than 
average boxes (i.e., less than 29.3)? 


Consider the following famous puzzle from 
the early days of probability, investigated 
by Pascal and Fermat. Which of the fol- 
lowing events is more likely: to roll at least 
one 6 in four rolls of a fair die, or to roll at 
least one double-6 in 24 rolls of two fair 
dice? 


a. Write a program to simulate a set of four 
die rolls many times, and use the results 
to estimate P(at least one 6 in 4 rolls). 

b. Now adapt your program to simulate 
rolling a pair of dice 24 times. Repeat 
this simulation many times, and use 
your results to estimate P(at least one 
double-6 in 24 rolls). 


The Problem of the Points. Pascal and 
Fermat also explored a question concern- 
ing how to divide the stakes in a game that 
has been interrupted. Suppose two players, 
Blaise and Pierre, are playing a game 
where the winner is the first to achieve a 
certain number of points. The game gets 
interrupted at a moment when Blaise needs 
n more points to win and Pierre needs 
m more to win. How should the game’s 
prize money be divvied up? Fermat argued 
that Blaise should receive a proportion of 
the total stake equal to the chance he 
would have won if the game hadn’t been 
interrupted (and Pierre the 
remainder). 

Assume the game is played in rounds, 
the winner of each round gets | point, 
rounds are independent, and the two 
players are equally likely to win any 
particular round. 
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a. Write a program to simulate the rounds 
of the game that would have happened 
after play was interrupted. A single 
simulation run should terminate as soon 
as Blaise has n wins or Pierre has 

(equivalently, Blaise has 
m losses). Use your program to estimate 
P(Blaise gets 10 wins before 15 losses), 
which is the proportion of the total stake 
Blaise should receive if n= 10 and 
m = 15. 

b. Use your same program to estimate the 
relevant probability when n =m = 10. 
Logically, what should the answer be? 
Is your estimated probability close to 
that? 

c. Finally, let’s assume Pierre is actually 
the better player: P(Blaise wins a 
round) = .4. Again with n= 10 and 
m = 15, what proportion of the stake 
should be awarded to Blaise? 


m~ wins 


Twenty faculty members in a certain 
department have just participated in a 
department chair election. Suppose that 
candidate A has received 12 of the votes 
and candidate B the other 8 votes. If the 
ballots are opened one by one in random 
order and the candidate selected on each 
ballot is recorded, use simulation to esti- 
mate the probability that candidate A 
remains ahead of candidate B throughout 
the vote count (which happens if, for 
example, the result is AA...AB...B but not 
if the result is AABABB...). 


Show that the (estimated) standard error for 
P(A) is at most 1/\/4n. 


Simulation can be used to estimate numer- 
ical constants, such as 7m. Here’s one 
approach: consider the part of a disk of 
radius 1 that lies in the first quadrant (a 
quarter-circle). Imagine two random num- 
bers, x and y, both between O and 1. The 
pair (x, y) lies somewhere in the first 
quadrant; let A denote the event that (x, y) 
falls inside the quarter-circle. 
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a. Write a program that simulates pairs 
(x, y) in order to estimate P(A), the 
probability that a randomly selected pair 
of points in the square [0, 1] x [0, 1] 
lies in the quarter-circle of radius 1. 

b. Using techniques from Chapter 5, it can 
be shown that the exact probability of 
A is 1/4 (which makes sense, because 
that’s the ratio of the quarter-circle’s 
area to the square’s area). Use that fact 
to come up with an estimate of n from 
your simulation. How close is your 
estimate to 3.14159...? 


Consider the quadratic equation ax” + bx + 
c = 0. Suppose that a, b, and c are random 
numbers between 0 and | (like those pro- 
duced by an RNG). Estimate the probability 
that the roots of this quadratic equation are 
real. [Hint: Think about the discriminant.] 
This probability can be computed exactly 
using methods from Chapter 5, but a triple 
integral is required. 
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The undergraduate statistics club at a cer- 
tain university has 24 members. 


a. All 24 members of the club are eligible 
to attend a conference next week, but 
they can only afford to send 4 people. In 
how many possible ways could 4 
attendees be selected? 

b. All club members are eligible for any of 
the four positions of president, VP, 
secretary, or treasurer. In how many 
possible ways can these positions be 
occupied? 

c. Suppose it’s agreed that two people will 
be cochairs, one person secretary, and 
one person treasurer. How many ways 
are there to fill these positions now? 


A small manufacturing company will start 
operating a night shift. There are 20 
machinists employed by the company. 
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a. Ifa night crew consists of 3 machinists, 
how many different crews are possible? 

b. If the machinists are ranked 1, 2, ..., 20 
in order of competence, how many of 
these crews would not have the best 
machinist? 

c. How many of the crews would have at 
least 1 of the 10 best machinists? 

d. If a 3-person crew is selected at random 
to work on a particular night, what is the 
probability that the best machinist will 
not work that night? 


A factory uses three production lines to 
manufacture cans of a certain type. The 
accompanying table gives percentages of 
nonconforming cans, categorized by type of 
nonconformance, for each of the three lines 
during a particular time period. 


Line 2 Line 3 


Blemish 15 12 20 
Crack 50 44 40 
Pull Tab Problem 21 28 24 
Surface Defect 10 8 15 
Other 4 8 2 
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During this period, line 1 produced 500 
nonconforming cans, line 2 produced 400 
such cans, and line 3 was responsible for 
600 nonconforming cans. Suppose that one 
of these 1500 cans is randomly selected. 


a. What is the probability that the can was 
produced by line 1? That the reason for 
nonconformance is a crack? 

b. If the selected can come from line 1, 
what is the probability that it had a 
blemish? 

c. Given that the selected can had a surface 
defect, what is the probability that it 
came from line 1? 


An employee of the records office at a 
university currently has ten forms on his 
desk awaiting processing. Six of these are 
withdrawal petitions, and the other four are 
course substitution requests. 
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a. If he randomly selects six of these forms 
to give to a subordinate, what is the 
probability that only one of the two 
types of forms remains on his desk? 

b. Suppose he has time to process only 
four of these forms before leaving for 
the day. If these four are randomly 
selected one by one, what is the proba- 
bility that each succeeding form is of a 
different type from its predecessor? 


One satellite is scheduled to be launched 
from Cape Canaveral in Florida, and 
another launching is scheduled for Van- 
denberg Air Force Base in California. Let 
A denote the event that the Vandenberg 
launch goes off on schedule, and let B rep- 
resent the event that the Cape Canaveral 
launch goes off on schedule. If A and B are 
independent events with P(A) > P(B) and 
P(AUB) = .626, P(AMB) = .144, deter- 
mine the values of P(A) and P(B). 


A transmitter is sending a message by using 
a binary code, namely a sequence of 0’s 
and 1’s. Each transmitted bit (0 or 1) must 
pass through three relays to reach the 
receiver. At each relay, the probability is .2 
that the bit sent will be different from the bit 
received (a reversal). Assume that the 
relays operate independently of one 
another. 


Transmitter — Relay 1 — Relay 2 
— Relay 3 — Receiver 


a. If a 1 is sent from the transmitter, what 
is the probability that a 1 is sent by all 
three relays? 

b. Ifa 1 is sent from the transmitter, what is 
the probability that a 1 is received by the 
receiver? [Hint: The eight experimental 
outcomes can be displayed on a tree 
diagram with three generations of bran- 
ches, one generation for each relay.] 
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c. Suppose 70% of all bits sent from the 
transmitter are 1’s. If a 1 is received by 
the receiver, what is the probability that 
a | was sent? 


Individual A has a circle of five close 
friends (B, C, D, E, and F). A has heard a 
certain rumor from outside the circle and 
has invited the five friends to a party to 
circulate the rumor. To begin, A selects one 
of the five at random and tells the rumor to 
the chosen individual. That individual then 
selects at random one of the four remaining 
individuals and repeats the rumor. Contin- 
uing, a new individual is selected from 
those not already having heard the rumor 
by the individual who has just heard it, until 
everyone has been told. 


a. What is the probability that the rumor is 
repeated in the order B, C, D, E, and F? 

b. What is the probability that F is the third 
person at the party to be told the rumor? 

c. What is the probability that F is the last 
person to hear the rumor? 


Refer to the previous exercise. If at each 
stage the person who currently “has” the 
rumor does not know who has already 
heard it and selects the next recipient at 
random from all five possible individuals, 
what is the probability that F has still not 
heard the rumor after it has been told ten 
times at the party? 


A chemist is interested in determining 
whether a certain trace impurity is present 
in a product. An experiment has a proba- 
bility of .80 of detecting the impurity if it is 
present. The probability of not detecting the 
impurity if it is absent is .90. The prior 
probabilities of the impurity being present 
and being absent are .40 and .60, respec- 
tively. Three separate experiments result in 
only two detections. What is the posterior 
probability that the impurity is present? 


Fasteners used in aircraft manufacturing are 
slightly crimped so that they lock enough to 
avoid loosening during vibration. Suppose 
that 95% of all fasteners pass an initial 
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inspection. Of the 5% that fail, 20% are so 
seriously defective that they must be 
scrapped. The remaining fasteners are sent 
to a re-crimping operation, where 40% 
cannot be salvaged and are discarded. The 
other 60% of these fasteners are corrected 
by the re-crimping process and subse- 
quently pass inspection. 


a. What is the probability that a randomly 
selected incoming fastener will pass 
inspection either initially or after re- 
crimping? 

b. Given that a fastener passed inspection, 
what is the probability that it passed the 
initial inspection and did not need re- 
crimping? 

One percent of all individuals in a certain 

population are carriers of a particular dis- 

ease. A diagnostic test for this disease has a 

90% detection rate for carriers and a 5% 

detection rate for noncarriers. Suppose the 

test is applied independently to two differ- 
ent blood samples from the same randomly 
selected individual. 


a. What is the probability that both tests 
yield the same result? 

b. If both tests are positive, what is the 
probability that the selected individual 
is a carrier? 

A system consists of two components. The 

probability that the second component 

functions in a satisfactory manner during its 
design life is .9, the probability that at least 

one of the two components does so is .96, 

and the probability that both components 

do so is .75. Given that the first component 
functions in a_ satisfactory manner 
throughout its design life, what is the 
probability that the second one does also? 


A certain company sends 40% of its over- 
night mail parcels via express mail service 
E,. Of these parcels, 2% arrive after the 
guaranteed delivery time (denote the event 
“late delivery” by L). If a record of an 
overnight mailing is randomly selected 
from the company’s file, what is the 
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probability that the parcel went via E, and 
was late? 


Refer to the previous exercise. Suppose that 
50% of the overnight parcels are sent via 
express mail service Fy and the remaining 
10% are sent via E3. Of those sent via E>, 
only 1% arrive late, whereas 5% of the 
parcels handled by £3 arrive late. 


a. What is the probability that a randomly 
selected parcel arrived late? 

b. If arandomly selected parcel has arrived 
on time, what is the probability that it 
was not sent via E,? 


A company uses three different assembly 
lines—A,, A>, and A3;—to manufacture a 
particular component. Of those manufac- 
tured by line A;, 5% need rework to remedy 
a defect, whereas 8% of A2’s components 
need rework and 10% of A3’s need rework. 
Suppose that 50% of all components are 
produced by line A;, 30% are produced by 
line A>, and 20% come from line A3. If a 
randomly selected component needs 
rework, what is the probability that it came 
from line A,? From line Az? From line A3? 


Disregarding the possibility of a February 
29 birthday, suppose a randomly selected 
individual is equally likely to have been 
born on any one of the other 365 days. If 
ten people are randomly selected, what is 
the probability that either at least two have 
the same birthday or at least two have the 
same last three digits of their Social Secu- 
rity numbers? [Note: The article “Methods 
for Studying Coincidences” (F. Mosteller 
and P. Diaconis, J. Amer. Statist. Assoc. 
1989: 853-861) discusses problems of this 
type. ] 

One method used to distinguish between 
granitic (G) and basaltic (B) rocks is to 
examine a portion of the infrared spectrum 
of the sun’s energy reflected from the rock 
surface. Let R,, Ro, and R3 denote measured 
spectrum intensities at three different 
wavelengths; typically, for granite R, < 
R> < R3, whereas for basalt R3 < R; < Ro. 


2 Probability 


When measurements are made remotely 
(using aircraft), various orderings of the 
R;’s may arise whether the rock is basalt or 
granite. Flights over regions of known 
composition have yielded the following 
information: 


Granite Basalt 
Ri < Ro <R; 60% 10% 
R, <R3< Ro 25% 20% 
R3<R, < Ro 15% 10% 


Suppose that for a randomly selected rock 
specimen in a certain region, P(granite) = 
.25 and P(basalt) = .75. 

a. Show that P(granite | R, < R2 < R3) 
> P(basalt | R, < Ro < R3). If measure- 
ments yielded R; < Rj < R3, would you 
classify the rock as granite or basalt? 

b. If measurements yielded R, < R3 < Ro, 
how would you classify the rock? 
Answer the same question for 
R3 < Ri < Ro. 

c. Using the classification rules indicated 
in parts (a) and (b), when selecting a 
rock from this region, what is the 
probability of an erroneous classifica- 
tion? [Hint: Either G could be classified 
as B or B as G, and P(B) and P(G) are 
known. ] 

d. If P(granite) = p rather than .25, are 
there values of p (other than 1) for 
which a rock would always be classified 
as granite? 


130. In a Little League baseball game, team A’s 


pitcher throws a strike 50% of the time and 
a ball 50% of the time, successive pitches 
are independent of each other, and the 
pitcher never hits a batter. Knowing this, 
team B’s manager has instructed the first 
batter not to swing at anything. Calculate 
the probability that 


a. The batter walks on the fourth pitch. 

b. The batter walks on the sixth pitch (so 
two of the first five must be strikes), 
using a counting argument or con- 
structing a tree diagram. 
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c. The batter walks. 

d. The first batter up scores while no one is 
out (assuming that each batter pursues a 
no-swing strategy). 


The Matching Problem. Four friends— 
Allison, Beth, Carol, and Diane—who have 
identical calculators are studying for a 
statistics exam. They set their calculators 
down in a pile before taking a study break 
and then pick them up in random order 
when they return from the break. 


a. What is the probability all four friends 
pick up the correct calculator? 

b. What is the probability that at least one 
of the four gets her own calculator? 
[Hint: Let A be the event that Alice gets 
her own calculator, and define events B, 
C, and D analogously for the other three 
students. How can the event {at least 
one gets her own calculator} be 
expressed in terms of the four events A, 
B, C, and D? Now use a general law of 
probability. ] 

c. Generalize the answer from part (b) to 
n individuals. Can you recognize the 
result when n is large (the approxima- 
tion to the resulting series)? 


A particular airline has 10 a.m. flights from 
Chicago to New York, Atlanta, and Los 
Angeles. Let A denote the event that the New 
York flight is full and define events B and 
C analogously for the other two flights. 
Suppose P(A) = .6, P(B) =.5, P(C) = 4 
and the three events are independent. What is 
the probability that 


a. All three flights are full? That at least 
one flight is not full? 

b. Only the New York flight is full? That 
exactly one of the three flights is full? 


The Secretary Problem. A personnel man- 
ager is to interview four candidates for a job. 
These are ranked 1, 2, 3, and 4 in order of 
preference and will be interviewed in ran- 
dom order. However, at the conclusion of 
each interview, the manager will know only 
how the current candidate compares to those 
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previously interviewed. For example, the 
interview order 3, 4, 1, 2 generates no 
information after the first interview and 
shows that the second candidate is worse 
than the first, and that the third is better than 
the first two. However, the order 3, 4, 2, 1 
would generate the same information after 
each of the first three interviews. The man- 
ager wants to hire the best candidate but must 
make an irrevocable hire/no hire decision 
after each interview. Consider the following 
strategy: Automatically reject the first s can- 
didates, and then hire the first subsequent 
candidate who is best among those already 
interviewed (if no such candidate appears, 
the last one interviewed is hired). 

For example, with s = 2, the order 3, 4, 1, 2 
would result in the best being hired, 
whereas the order 3, 1, 2, 4 would not. Of 
the four possible s values (0, 1, 2, and 3), 
which one maximizes P(best is hired)? 
[Hint: Write out the 24 equally likely 
interview orderings; s = 0 means that the 
first candidate is automatically hired.] 


Consider four independent events A, Ao, 
A3, and A, and let p; = P(A,) fori = 1, 2, 3, 
4. Express the probability that at least one 
of these four events occurs in terms of the 
p;’s, and do the same for the probability that 
at least two of the events occur. 


A box contains the following four slips of 
paper, each having exactly the same 
dimensions: (1) win prize 1; (2) win prize 2; 
(3) win prize 3; (4) win prizes 1, 2, and 3. 
One slip will be randomly selected. Let 
A, = {win prize 1}, A> = {win prize 2}, and 
A3 = {win prize 3}. Show that A, and A> are 
independent, that A, and A3 are indepen- 
dent, and that A, and Az; are also indepen- 
dent (this is pairwise independence). 
However, show that P(A;MA2,NA3) 4 
P(A,) - P(A2) - P(A3), so the three events 
are not mutually independent. 


Consider a woman whose brother is afflic- 
ted with hemophilia, which implies that the 
woman’s mother has the hemophilia gene 
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on one of her two X chromosomes (almost 
surely not both, since that is generally 
fatal). Thus there is a 50-50 chance that the 
woman’s mother has passed on the bad 
gene to her. The woman has two sons, each 
of whom will independently inherit the 
gene from one of her two chromosomes. If 
the woman herself has a bad gene, there is a 
50-50 chance she will pass this on to a son. 
Suppose that neither of her two sons is 
afflicted with hemophilia. What then is the 
probability that the woman is indeed the 
carrier of the hemophilia gene? What is this 
probability if she has a third son who is also 
not afflicted? 


Jurors may be a priori biased for or against 
the prosecution in a criminal trial. Each juror 
is questioned by both the prosecution and 
the defense (the voir dire process), but this 
may not reveal bias. Even if bias is revealed, 
the judge may not excuse the juror for cause 
because of the narrow legal definition of 
bias. For a randomly selected candidate for 
the jury, define events Bo, B,, and B, as the 
juror being unbiased, biased against the 
prosecution, and biased against the defense, 
respectively. Also let C be the event that bias 
is revealed during the questioning and D be 
the event that the juror is eliminated 
for cause. Let b; = P(B) @ = 0, 1, 2), c= 
P(C | B,) = P(C | Bz), and d = P(D | B, NC) 
= P(D| BM C) [“Fair Number of Peremptory 
Challenges in Jury Trials,” J. Amer. Statist. 
Assoc. 1979: 747-753]. 


a. If a juror survives the voir dire process, 
what is the probability that he/she is 
unbiased (in terms of the b;’s, c, and d)? 
What is the probability that he/she is 
biased against the prosecution? What is 
the probability that he/she is biased 
against the defense? [Hint: Represent 
this situation using a tree diagram with 
three generations of branches. ] 

b. What are the probabilities requested in 
(a) if bo = .50, by = .10, bp = .40 (all 
based on data relating to the famous trial 
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of the Florida murderer Ted Bundy), 
c = .85 (corresponding to the extensive 
questioning appropriate in a capital 
case), and d = .7 (a “moderate” judge)? 
Gambler’s Ruin. Allan and Beth currently 
have $2 and $3, respectively. A fair coin is 
tossed. If the result of the toss is H, Allan 
wins $1 from Beth, whereas if the coin toss 
results in T, then Beth wins $1 from Allan. 
This process is then repeated, with a coin toss 
followed by the exchange of $1, until one of 
the two players goes broke (one of the two 
gamblers is ruined). We wish to determine 


a2 = P(Allan is the winner|he starts with $2) 


To do so, let’s also consider probabilities 


a; = P(Allan wins|he starts with $i) for 
i=0,1,3,4, and5. 


a. What are the values of dg and as? 

b. Use the Law of Total Probability to 
obtain an equation relating az to a, and 
a3. [Hint: Condition on the result of the 
first coin toss, realizing that if it is a H, 


then from that point Allan starts 
with $3.] 
c. Using the logic described in (b), 


develop a system of equations relating 
a; (i = 1, 2, 3, 4) to a_, and a;,,. Then 
solve these equations. [Hint: Write each 
equation so that a; — a;_; is on the left- 
hand side. Then use the result of the first 
equation to express each other a; — a;_; 
as a function of a,, and add together 
all four of these expressions (i = 2, 3, 
4, 5).] 

d. Generalize the result to the situation in 
which Allan’s initial fortune is $a and 
Beth’s is $b. [Note: The solution is a bit 
more complicated if p = P(Allan wins 


$1) # 5.) 


An event A is said to attract event B if 
P(B | A) > P(B) and repel B if P(B | A) < 
P(B). (This refines the notion of dependent 
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events by specifying whether A makes 
B more likely or less likely to occur.) 


a. Show that if A attracts B, then A repels B’. 

b. Show that if A attracts B, then A repels B. 

c. Prove the Law of Mutual Attraction: 
event A attracts event B if, and only if, 
B attracts A. 


A fair coin is tossed repeatedly until either 
the sequence TTH or the sequence THT is 
observed. Let B be the event that stopping 
occurs because TTH was observed (i.e., 
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that TTH is observed before THT). Calcu- 
late P(B). [Hint: Consider the following 
partition of the sample space: A, = {lst 
toss is H}, Az = {1st two tosses are TT}, 
A3 = {lst three tosses are THT}, and 
Ag = {lst three tosses are THH}. Also 
denote P(B) by p. Apply the Law of Total 
Probability, and p will appear on both sides 
in various places. The resulting equation is 
easily solved for p.] 
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Introduction 

Suppose a city’s traffic engineering department monitors a certain intersection during a one-hour 
period in the middle of the day. Many characteristics might be of interest: the number of vehicles that 
enter the intersection, the largest number of vehicles in the left turn lane during a signal cycle, the 
speed of the fastest vehicle going through the intersection, the average speed x of all vehicles entering 
the intersection. The value of each one of the foregoing variable quantities is subject to uncertainty— 
we don’t know a priori how many vehicles will enter, what the maximum speed will be, etc. So each 
of these is referred to as a random variable—a variable quantity whose value is determined by what 
happens in a chance experiment. 

The most commonly encountered random variables are one of two fundamentally different types: 
discrete random variables and continuous random variables. In this chapter, we examine the basic 
properties and discuss the most important examples of discrete variables. Chapter 4 focuses on 
continuous random variables. 


3.1 Random Variables 


In any experiment, numerous characteristics can be observed or measured, but in most cases an 
experimenter will focus on some specific aspect or aspects of a sample. For example, in a study of 
commuting patterns in a metropolitan area, each individual in a sample might be asked about 
commuting distance and the number of people commuting in the same vehicle, but not about IQ, 
income, family size, and other such characteristics. Alternatively, a researcher may test a sample of 
components and record only the number that have failed within 1000 h, rather than record the 
individual failure times. 

In general, each outcome of an experiment can be associated with a number by specifying a rule of 
association, e.g., the number among the sample of ten components that fail to last 1000 h, or the total 
baggage weight for a sample of 25 airline passengers. Such a rule of association is called a random 
variable—a variable because different numerical values are possible and random because the 
observed value depends on which of the possible experimental outcomes results (Figure 3.1). 
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Figure 3.1 A random variable 


DEFINITION For a given sample space § of some experiment, a random variable (rv) 
is any rule that associates a number with each outcome in ¥. In mathematical 
language, a random variable is a function whose domain is the sample space 
and whose range is some subset of real numbers. 


Random variables are customarily denoted by uppercase letters, such as X and Y, near the end of our 
alphabet. In contrast to our previous use of a lowercase letter, such as x, to denote a variable, we will 
now use lowercase letters to represent some particular value of the corresponding random variable. 
The notation X(s) = x means that x is the value associated with the outcome s by the rv X. 


Example 3.1 When a student attempts to connect to a university’s WIFI network, either there is a 
failure (F) or there is a success (S). With ¥ = {S, F}, define arv X by X(S) = 1, X(F) = 0. Therv X indicates 
whether (1) or not (0) the student can connect. i} 


In Example 3.1, the rv X was specified by explicitly listing each element of § and the associated 
number. If £ contains more than a few outcomes, such a listing is tedious, but it can frequently be 
avoided. 


Example 3.2 Consider the experiment in which a telephone number is dialed using a random 
number dialer (such devices are used extensively by polling organizations), and define a rv Y by 


y= 1 if the selected number is on the National Do Not Call Registry 
~ | 0 if the selected number is not on the registry 


For example, if 916-528-2966 appears on the national registry, then Y(916-528-2966) = 1, whereas 
Y(213-772-7350) = 0 tells us that the number 213-772-7350 is not on the registry. A word description 
of this sort is more economical than a complete listing, so we will use such a description whenever 
possible. B 


In Examples 3.1 and 3.2, the only possible values of the random variable were 0 and 1. Such a 
random variable arises frequently enough to be given a special name, after the individual who first 
studied it. 


DEFINITION Any random variable whose only possible values are 0 and | is called a 
Bernoulli random variable. 


We will often want to define and study several different random variables from the same sample 
space. 
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Example 3.3 Example 2.3 described an experiment in which the number of pumps in use at each of 
two gas stations was determined. Define rvs X, Y, and U by 


X = the total number of pumps in use at the two stations 
Y = the difference between the number of pumps in use at station | and the number in use at station 2 
U = the maximum of the numbers of pumps in use at the two stations 


If this experiment is performed and s = (2, 3) results, then X((2, 3)) = 2 + 3 = 5, so we say that the 
observed value of X is x = 5. Similarly, the observed value of Y would be y = 2 — 3 = —1, and the 
observed value of U would be u = max(2, 3) = 3. = 


Each of the random variables of Examples 3.1—3.3 can assume only a finite number of possible 
values. This need not be the case. 


Example 3.4 Consider any general inspection process, wherein items are examined one by one until 
we find an item that falls within required specification limits. The sample space of such an experiment 
is § = {S, FS, FFS, ...}. Define a rv X by 


X = the number of items examined until a ‘‘good’’ one is found 


Then X(S) = 1,X(FS) = 2,X(FFS) = 3,...,X(FFFFFFS) = 7, and so on. Any positive integer is a 
possible value of X, so the set of possible values is infinite. fe 


Example 3.5 Suppose that in some random fashion, a location (latitude and longitude) in the 
continental USA is selected. Define a rv Y by 


Y = the height above sea level at the selected location 


For example, if the selected location were (39° 50’ N, 98° 35' W), then it might be the case that 
Y((39° 50’ N, 98° 35’ W)) = 1748.26 ft. The largest possible value of Yis 14,494 (Mt. Whitney), and 
the smallest possible value is —282 (Death Valley). The set of all possible values of Y is the set of all 
numbers in the interval between —282 and 14,494— that is, 


{y: y is a number, —282 <y< 14,494} = [—282, 14,494] 
and there are infinitely many numbers in this interval (an entire continuum). a 


Two Types of Random Variables 

In Section 1.2 we distinguished between data resulting from observations on a counting variable and 
data obtained by observing values of a measurement variable. A slightly more formal distinction 
characterizes two different types of random variables. 


DEFINITION A discrete random variable is a rv whose possible values constitute either 
a finite set or a countably infinite set.! 


'For those unfamiliar with the term, a countably infinite set is one for which the elements can be enumerated: a first 
element, a second element, and so on. The set of all positive integers and the set of all integers are both countably 
infinite, but an interval like [2, 5] on the number line is not. 
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A random variable is continuous if both of the following apply: 


i. Its set of possible values consists either of all numbers in a single interval on the 
number line (possibly infinite in extent, e.g., from —oo to oo) or all numbers in a 
disjoint union of such intervals (e.g., [0, 10] U [20, 30]). 

ii. No possible value of the variable has positive probability, that is, PX = c) = 0 for any 
possible value c. 


Although any interval on the number line contains infinitely many numbers, it can be shown that there 
is no way to create a listing of all these values—there are just too many of them. The second condition 
describing a continuous random variable is perhaps counterintuitive, since it would seem to imply a 
total probability of zero for all possible values. But we shall see in Chapter 4 that intervals of values 
have positive probability; the probability of an interval will decrease to zero as the width of the 
interval shrinks to zero. 


Example 3.6 All random variables in Examples 3.1—3.4 are discrete. As another example, suppose 
we select married couples at random and do a blood test on each person until we find a pair of spouses 
who have the same Rh factor. With X = the number of blood tests to be performed, possible values of 
X are 2, 4, 6, 8, .... Since the possible values have been listed in sequence, X is a discrete rv. Ml 


To study basic properties of discrete rvs, only the tools of discrete mathematics—summation and 
differences—are required. The study of continuous variables requires the continuous mathematics of 
the calculus—integrals and derivatives. 


Exercises: Section 3.1 (1-10) 


1. A concrete beam may fail either by shear 
(S) or flexure (F). Suppose that three failed 
beams are randomly selected and the type 


6. Starting at a fixed time, each car entering an 
intersection is observed to see whether it 
turns left (Z), right (R), or goes straight 


of failure is determined for each one. Let 
X = the number of beams among the three 
selected that failed by shear. List each 
outcome in the sample space along with the 
associated value of X. 


. Give three examples of Bernoulli rvs (other 
than those in the text). 


. Using the experiment in Example 3.3, 
define two more random variables and list 
the possible values of each. 

. Let X = the number of nonzero digits in a 
randomly selected zip code. What are the 
possible values of X? Give three possible 
outcomes and their associated X values. 


. Ifthe sample space S is an infinite set, does this 
necessarily imply that any rv X defined from £ 


will have an infinite set of possible values? If 
yes, say why. If no, give an example. 


ahead (A). The experiment terminates as 
soon as a car is observed to turn left. Let 
X = the number of cars observed. What are 
possible X values? List five outcomes and 
their associated X values. 


. For each random variable defined here, 


describe the set of possible values for the 
variable, and state whether the variable is 
discrete. 


a. X= the number of unbroken eggs in a 
randomly chosen standard egg carton 

b. Y= the number of students on a class 
list for a particular course who are 
absent on the first day of classes 

c. U=the number of times a duffer has 
to swing at a golf ball before hitting it 

d. X= the length of a randomly selected 
rattlesnake 
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e. Z=the amount of royalties earned sides), Claudius first moves to one of the 
from the sale of a first edition of four locations B,, Bz, B3, By. Once at one of 
10,000 textbooks these locations, he uses another random- 

f. Y= the pH of a randomly chosen soil ization device to decide whether he next 
sample returns to 0 or next visits one of the other 

g. X =the tension (psi) at which a ran- two adjacent points. This process then 
domly selected tennis racket has been continues; after each move, another move 
strung to one of the (new) adjacent points is 

h. X= the total number of coin tosses determined by tossing an appropriate die or 
required for three individuals to obtain coin. 

a match (HHH or TTT) a. Let X =the number of moves that 
8. Each time a component is tested, the trial is Claudius makes before first returning 

a success (S) or failure (F). Suppose the to 0. What are possible values of X? Is 

component is tested repeatedly until a X discrete or continuous? 

success occurs on three consecutive trials. b. If moves are allowed also along the 

Let Y denote the number of trials necessary diagonal paths connecting 0 to Aj, Ao, 

to achieve this. List all outcomes corre- A3, and Ay, respectively, answer the 

sponding to the five smallest possible val- questions in part (a). 


ues of Y, and state which Y value is 1(. The number of pumps in use at both a six- 


associated with each one. pump station and a four-pump station will 


9. An individual named Claudius is located at be determined. Give the possible values for 
the point 0 in the accompanying diagram. each of the following random variables: 
a Be as a. T = the total number of pumps in use 


b. X= the difference between the num- 
bers in use at stations | and 2 

c. U =the maximum number of pumps in 
use at either station 

d. Z=the number of stations having 
exactly two pumps in use 


Using an appropriate randomization device 
(such as a tetrahedral die, one having four 
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When probabilities are assigned to various outcomes in ¥, these in turn determine probabilities 
associated with the values of any particular rv X. The probability distribution of X says how the total 
probability of 1 is distributed among (allocated to) the various possible X values. 


Example 3.7 Six batches of components are ready to be shipped by a supplier. The number of 
defective components in each batch is as follows: 


Batch 1 2 3 4 i) 
Number of defectives 0 2 0 1 2 0 


a 


One of these batches is to be randomly selected for shipment to a customer. Let X be the number of 
defectives in the selected batch. The three possible X values are 0, 1, and 2. Of the six equally likely 
simple events, three result in X = 0, one in X = 1, and the other two in X = 2. Let p(0) denote the 
probability that X = 0 and p(1) and p(2) represent the probabilities of the other two possible values of 
X. Then 
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3 
p(0) = P(X = 0) = P(lot 1 or 3 or 6 is sent) = = .500 


p(1) = P(X = 1) = P(lot 4 is sent) = = = .167 


oN ae 


2 
p(2) = P(X = 2) = P(lot 2 or 5 is sent) = es .333 


That is, a probability of .500 is distributed to the X value 0, a probability of .167 is placed on the 
X value 1, and the remaining probability, .333, is associated with the X value 2. The values of X along 
with their probabilities collectively specify the probability distribution or probability mass function of 
X. If this experiment were repeated over and over again, in the long run X = 0 would occur one-half 
of the time, X = 1 one-sixth of the time, and X = 2 one-third of the time. |_| 


DEFINITION The probability distribution or probability mass function (pmf) of a discrete 
rv is defined for every number x by 
P(x) = P(X= x) = P(all s € &: X(s) =x)? 
The support of p(x) consists of all x values for which p(x) > 0. We will display 
a pmf for the values in its support, and it is always understood that p(x) = 0 
otherwise (i.e., for all other x values). 


In words, for every possible value x of the random variable, the pmf specifies the probability of 
observing that value when the experiment is performed. The conditions p(x) >0 and >> p(x) = 1, 
where the summation is over all possible x, are required of any pmf. 


Example 3.8 Consider randomly selecting a student at a large public university, and define a 
Bernoulli rv by X = | if the selected student does not qualify for in-state tuition (a success from the 
university administration’s point of view) and X = 0 if the student does qualify. If 20% of all students 
do not qualify, the pmf for X is 


(0) (X = 0) = P(the selected student does qualify) = .8 
(1) (X = 1) = P(the selected student does not qualify) = .2 
p(x) = P(X =x) =0 for x £0 or 1. 


p(0) = 
pil 


That is, 
p= 8 ifx=0 
PY) 2 ifx=1 
Figure 3.2 (p. 117) is a picture of this pmf, called a line graph. a 


Example 3.9 An electronics laboratory has five identical-looking power sources, of which only two 
are fully charged. The power sources will be tested one by one until a fully charged one is found. Let 
the rv Y = the number of tests necessary to identify a fully charged source. Let A and B represent the 
two fully charged power sources and C, D, E the other three. Then the pmf of Y is 


?P(X = x) is read “the probability that the rv X assumes the value x.” For example, P(X = 2) denotes the probability that 
the resulting X value is 2. 
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P(x) 4 
1 |e 


Ls 


0 1 


Figure 3.2 The line graph for the pmf in Example 3.8 


2 
p(1) = PY = 1) = P(A or B tested first) === 4 
(Y = 2) = P(C,D, or E first, and then A or B) 


=s 

Lv 
I 
v 


2 
= P(C,D,or E first) - P(A or B next] C,D, or E first) = : a 3 


322 
p(3) = P(Y = 3) = P(C,D,or E first and second, and then A or B) = er 2 
; 3.2 1 
p(4) = P(Y = 4) = P(C,D, and E all done first) = a> 1 
p(y) =0 fory#1,2,3,4 
The pmf can be presented compactly in tabular form: 
y 1 2 3 4 
20) A 3 2 l 


where any y value not listed receives zero probability. This pmf can also be displayed in a line graph 
(Figure 3.3). 


pv) A 
in ie 
[1 = 
0 1 2 3 4 
Figure 3.3 The line graph for the pmf in Example 3.9 a 


The name “probability mass function” is suggested by a model used in physics for a system of 
“point masses.” In this model, masses are distributed at various locations x along a one-dimensional 
axis. Our pmf describes how the total probability mass of 1 is distributed at various points along the 
axis of possible values of the random variable (where and how much mass at each x). 
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Another useful pictorial representation of a pmf, called a probability histogram, is similar to 
histograms discussed in Chapter |. Above each x in the support of X, construct a rectangle centered at 
x. The height of each rectangle is proportional to p(x), and the base is the same for all rectangles. 
When possible values are equally spaced, the base is frequently chosen as the distance between 
successive x values (though it could be smaller). Figure 3.4 shows two probability histograms. 


0 1 1 2 3 4 


Figure 3.4 Probability histograms: (a) Example 3.8; (b) Example 3.9 


A Parameter of a Probability Distribution 

In Example 3.8, we had p(O) = .8 and p(1) = .2 because 20% of all students did not qualify for in- 
state tuition. At another university, it may be the case that p(0) = .9 and p(1) = .1. More generally, the 
pmf of any Bernoulli rv can be expressed in the form p(1) = « and p(O) = 1 — a, where 0 <a < 1. 
Because the pmf depends on the particular value of «, we often write p(x; «) rather than just p(x): 


l-oa ifx=0 


a ifx=1 eH) 


plas a) = { 


Then each choice of « in Expression (3.1) yields a different pmf. 


DEFINITION Suppose p(x) depends on a quantity that can be assigned any one of a number 
of possible values, with each different value determining a different probability 
distribution. Such a quantity is called a parameter of the distribution. The 
collection of all probability distributions for different values of the parameter is 
called a family of probability distributions. 


The quantity « in Expression (3.1) is a parameter. Each different number « between 0 and | deter- 
mines a different member of a family of distributions; two such members are 


A ifx=0 5 ifx=0 
poss) ={ 4 fan o pos 3)= 43 ifx=1 


Every probability distribution for a Bernoulli rv has the form of Expression (3.1), so it is called the 
family of Bernoulli distributions. 


Example 3.10 In many communication systems, a receiver will send a short signal back to the 
transmitter to indicate whether a message has been received correctly or with errors. These signals are 
often called an acknowledgement (A) and a nonacknowledgement (N), respectively. (Bit sum checks 
and other tools are used by the receiver to determine the absence or presence of errors.) Let p = P(A), 
assume that successive transmission attempts are independent, and define a rv X = number of 
attempts required to successfully transmit one message. Then 
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and 


Continuing this way, a general formula emerges: 


p(x) =(1—py'p x =1,2,3,... (3.2) 


The parameter p can assume any value between 0 and 1. Expression (3.2) describes the family of 
geometric distributions. In most modern communication systems, p is very close to 1, but in a noisy 
system (such as on a WIFI network with lots of interference and/or intervening walls), p could be 
considerably lower. i 


The Cumulative Distribution Function 
For some fixed value x, we often wish to compute the probability that the observed value of X will be 
at most x. For example, the pmf in Example 3.7 was 


500 x=0 
p(x) = 4 .167 x=1 
333 x=2 


The probability that X is at most 1 is then 
P(X <1) = p(0) + p(1) = .500 + .167 = .667 


In this example, X < 1.5 iff X < 1, so PX < 1.5) = P(X < 1) =.667. Similarly, PX < 0)= 
P(X = 0) = .5, and P(X < .75) = .5 also. Since 0 is the smallest possible value of X, P(X < —1.7) = 
0, P(X < —.0001) = 0, and so on. The largest possible X value is 2, so P(X < 2) = 1. And if xis any 
number larger than 2, P(X < x) = 1; that is, PX < 5) = 1, PX < 10.23) = 1, and so on. 

Critically, notice that P(X < 1) = P(X = 0) = .5 # P(X < 1), since the latter probability includes 
the probability mass at the x value 1, while P(X < 1) does not. When X is a discrete random variable 
and x is in the support of X, P(X < x) < P(X < x). 


DEFINITION The cumulative distribution function (cdf) F(x) of a discrete rv variable 
X with pmf p(x) is defined for every number x by 


F(x) = P(X<x) = S> p(y) (3.3) 


yy Sx 


For any number x, F(x) is the probability that the observed value of X 
will be at most x. 


Example 3.11 An online retailer sells flash drives with either 16, 32, 64, 128, or 256 GB of memory. 
The accompanying table gives the distribution of Y = the amount of memory in a purchased drive: 
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py) 05 10 35 40 .10 


F(16) = P(Y < 16) = P(Y = 16) = p(16) = .05 

F(32) = P(Y < 32) = P(Y = 16 or 32) = p(16) +p(32) = .15 

F(64) = P(Y < 64) = P(Y = 16 or 32 or 64) = p(16) + p(32) + p(64) = .50 
F(128) = P(Y < 128) = p(16) +p(32) +p(64) + p(128) = .90 

F(256) = P(Y < 256) = 1 


Now for any other number y, F(y) will equal the value of F at the closest possible value of y to the left 
of y. For example, 


F(48.7) = P(Y <48.7) = P(Y <32) = F(32) = .15 
F (127.999) = P(Y < 127.999) = P(Y < 64) = F(64) = .50 


If y is less than 16, F(y) = 0 [e.g., F(8) = O], and if y is at least 256, F(y) = 1 [e.g., F(512) = 1]. The 
cdf is thus 
0 y<16 
05 16<y<32 
AS 32<y<64 
FO)=9 's0 64<y<128 
90 128<y<256 


1 256<y 
A graph of this cdf is shown in Figure 3.5. 
Fv) 
A 
1.05 —- 
0.8 5 
0.6 5 
0.4 
0.2 5 
Ce — 
————— 
0 80 160 240 320 


Figure 3.5 A graph of the cdf of Example 3.11 | 
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For a discrete rv X, the graph of F(x) will have a jump at every possible value of X and will be flat 
between possible values. Such a graph is called a step function. 


Example 3.12 In Example 3.10, any positive integer was a possible X value, and the pmf was 


p(x) =(1—p)* 'p x= 1,2,3,... 
For any positive integer x, 


x 


F@)= 9G) => =p =p — py (3.4) 
y=0 


yeux y=l 


To evaluate this sum, we use the fact that the partial sum of a geometric series is 


k k+1 


, l-a 
e= =o 
y=0 2 


Using this in Equation (3.4), with a = 1 — p and k =x — 1, gives 


Ail py 
1—(1—p) 


F(x) =p =1-—(l-—p)* xa positive integer 


Since F is constant in between positive integers, 


FQ\= { L_ a pt ee (3.5) 


where [x] is the largest integer < x (e.g., [2.7] = 2). 

In an extremely noisy channel with p = .15, the probability of having to transmit a message at most 
5 times to get an acknowledgement is F(5) = 1 — (1 — .15)° = .5563, whereas F(50) + 1.0000. This 
cdf is graphed in Figure 3.6. 


F(x) 
A 
14 
__@—____®— 
e____*_ 
ee — 
e——__ 
e—__—_—__ 
T T T T T INQ T -—* 
0 1 2 3 4 5 50 51 


Figure 3.6 A graph of F(x) for Example 3.12 | 
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In our examples thus far, the cdf has been derived from the pmf. This process can be reversed to 
obtain the pmf from the cdf whenever the latter function is available. Suppose, for example, that 
X represents the number of defective components in a shipment consisting of six components, so that 
possible X values are 0, 1, ..., 6. Then 


More generally, the probability that X falls in a specified interval is easily obtained from the cdf. For 
example, 


P(2<X <4) = p(2) + p(3) + p(4) 
= [p(0) + --- + p(4)] — [p(0) + pV) 
= P(X <4) —P(X<1) 
= F(4) — F() 


Notice that P22 < X < 4) # F(4) — F(2). This is because the X value 2 is included in2 < X < 4, 
so we do not want to subtract out its probability. However, P(2 < X < 4) = F(4) — F(2) because 
X = 2 is not included in the interval 2 < X < 4. 


PROPOSITION For any two numbers a and b witha < b, 
P(a<X <b) = F(b) — F(a—) 


where F(a—) represents the limit of F(x) as x approaches a from the left. In 
particular, if the only possible values are integers and if a and b are 
integers, then 


P(a<X<b) =P(X =aora+lor...orb) 
= F(b) — F(a— 1) 


Taking a = b yields P(X = a) = F(a) — F(a — 1) in this case. 


The reason for subtracting F(a—) rather than F(a) is that we want to include the probability mass at 
X =a; F(b) — F(a) gives P(a < X < b). This proposition will be used extensively when computing 
binomial and Poisson probabilities in Sections 3.5 and 3.6. 

Example 3.13 Let X = the number of days of sick leave taken by a randomly selected employee of a 
large company during a particular year. If the maximum number of allowable sick days per year is 14, 
possible values of X are 0, 1, ..., 14. With F(O) = .58, FC.) = .72, F(2) = .76, F(3) = .81, F(4) = .88, 
and F(5) = .94, 


P(2<X <5) = P(X =2,3,4, or 5) = F(5) — F(1) = .22 


3.2 Probability Distributions for Discrete Random Variables 123 


and 


Another View of Probability Mass Functions 
It is often helpful to think of a pmf as specifying a mathematical model for a discrete population. 


Example 3.14 Consider selecting at random a student who is among the 15,000 registered for the 
current term at a certain university. Let X = the number of courses for which the selected student is 
registered, and suppose that X has the following pmf: 


x 1 2 3 4 5 6 7 
p(x) O01 .03 13 25, 39 7 02 


One way to view this situation is to think of the population as consisting of 15,000 individuals, 
each having his or her own X value; the proportion with each X value is given by p(x). An alternative 
viewpoint is to forget about the students and think of the population itself as consisting of the 


X values: There are some 1’s in the population, some 2’s, ..., and finally some 7’s. The population 
then consists of the numbers 1, 2, ..., 7 (so is discrete), and p(x) gives a model for the distribution of 
population values. . 


Once we have such a population model, we will use it to compute values of population charac- 
teristics (e.g., the mean jz) and make inferences about such characteristics. 


Exercises: Section 3.2 (11-27) 


11. Let X be the number of students who show The probability mass function of Y appears 
up at a professor’s office hours on a par- in the accompanying table. 
ticular day. Suppose that the only possible 
values of X are 0, 1, 2, 3, and 4, and that y i ee 


pO) = .30, pl) =.25, p(2)=.20, and py) 05 10 12 14 25 17 06 .05 03 02 .O1 


p(B) = -15. a. What is the probability that the flight 


a. What is p(4)? will accommodate all ticketed passen- 
b. Draw both a line graph and a proba- gers who show up? 

bility histogram for the pmf of X. b. What is the probability that not all 

c. What is the probability that at least two ticketed passengers who show up can 
students come to the office hour? What be accommodated? 

is the probability that more than two c. Ifyou are the first person on the standby 

students come to the office hour? list (which means you will be the first 

d. What is the probability that the pro- one to get on the plane if there are any 

fessor shows up for his office hour? seats available after all ticketed pas- 

12. Airlines sometimes overbook flights. Sup- sengers have been accommodated), 

pose that for a plane with 50 seats, 55 what is the probability that you will be 

passengers have tickets. Define the random able to take the flight? What is this 

variable Y as the number of ticketed pas- probability if you are the third person 


sengers who actually show up for the flight. on the standby list? 
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13. 


14. 


15. 
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A mail-order computer business has six 
telephone lines. Let X denote the number of 
lines in use at a specified time. Suppose the 
pmf of X is as given in the accompanying 
table. 


10 15 20 25 20 06 = .04 


Calculate the probability of each of the 
following events. 


{at most three lines are in use} 

{fewer than three lines are in use} 

{at least three lines are in use} 
{between two and five lines, inclusive, 

are in use} 
e. {between two and four lines, inclusive, 
are not in use} 
f. {at least four lines are not in use} 


ao op 


A contractor is required by a county plan- 
ning department to submit one, two, three, 
four, or five forms (depending on the nature 
of the project) in applying for a building 
permit. Let Y=the number of forms 
required of the next applicant. The proba- 
bility that y forms are required is known to 
be proportional to y—that is, p(y) = ky for 
yee 1 ds 5. 

a. What is the value of k? 


YP) = 1 

b. What is the probability that at most 
three forms are required? 

c. What is the probability that between 
two and four forms (inclusive) are 
required? 

d. Could p(y) = y7/50 for y = 1, ..., 5 be 
the pmf of Y? 


Many manufacturers have quality control 
programs that include inspection of 
incoming materials for defects. Suppose a 
computer manufacturer receives computer 
boards in lots of five. Two boards are 
selected from each lot for inspection. We 
can represent possible outcomes of the 
selection process by pairs. For example, the 
pair (1, 2) represents the selection of boards 
1 and 2 for inspection. 


(Hint: 


16. 


17. 


a. List the different 
outcomes. 

b. Suppose that boards | and 2 are the 
only defective boards in a lot of five. 
Two boards are to be chosen at ran- 
dom. Define X to be the number of 
defective boards observed among those 
inspected. Determine the probability 
distribution of X. 

c. Let F(x) denote the cdf of X. First 
determine F(0) = P(X < 0), F(1), and 
FQ), and then obtain F(x) for all 
other x. 


ten possible 


Some parts of California are particularly 
earthquake-prone. Suppose that in one such 
area, 30% of all homeowners are insured 
against earthquake damage. Four home- 
owners are to be selected at random; let 
X denote the number among the four who 
have earthquake insurance. 


a. Find the probability distribution of X. 
[Hint: Let S denote a homeowner who 
has insurance and F one who does not. 
One possible outcome is SFSS, with 
probability (.3)(.7)(.3)(.3) and associ- 
ated X value 3. There are 15 other 
outcomes. | 

b. Draw the corresponding probability 

histogram. 

What is the most likely value for X? 

d. What is the probability that at least two 
of the four selected have earthquake 
insurance? 


© 


A new battery’s voltage may be acceptable 
(A) or unacceptable (U). A certain flashlight 
requires two batteries, so batteries will be 
independently selected and tested until two 
acceptable ones have been found. Suppose 
that 90% of all batteries have acceptable 
voltages. Let Y denote the number of bat- 
teries that must be tested. 
a. What is p(2), that is, P(Y = 2)? 
b. What is p(3)? [Hint: There are two 
different outcomes that result in Y = 3.] 
c. To have Y=5, what must be true of 
the fifth battery selected? List the four 
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18. 


19. 


outcomes for which Y=5 and then 
determine p(5). 

d. Use the pattern in your answers for 
parts (a)-(c) to obtain a general for- 
mula for p(y). 


Two fair six-sided dice are tossed inde- 

pendently. Let M =the maximum of the 

two tosses [thus M(1, 5) = 5, M(3, 3) = 3, 

etc. ]. 

a. What is the pmf of M? [Hint: First 
determine p(1), then p(2), and so on.] 

b. Determine the cdf of M and graph it. 


Suppose that you read through this year’s 
issues of the New York Times and record each 
number that appears in a news article—the 
income of a CEO, the number of cases of 
wine produced by a winery, the total char- 
itable contribution of a politician during the 
previous tax year, the age of a celebrity, 
and so on. Now focus on the leading digit 
of each number, which could be 1, 2, ..., 8, 
or 9. Your first thought might be that the 
leading digit X of a randomly selected 
number would be equally likely to be one 
of the nine possibilities (a discrete uniform 
distribution). However, much empirical 
evidence as well as some theoretical argu- 
ments suggest an alternative probability 
distribution called Benford’s law: 

p(x) = P(Ist digit is.x) = ogio(“**) 


Xx 


a. Without computing individual proba- 
bilities from this formula, show that it 
specifies a legitimate pmf. 

b. Now compute the individual probabil- 

ities and compare to the corresponding 

discrete uniform distribution. 

Obtain the cdf of X. 

d. Using the cdf, what is the probability 
that the leading digit is at most 3? At 
least 5? 


© 


20. 


21. 


22. 


23. 
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[Note: Benford’s law is the basis for some 
auditing procedures used to detect fraud in 
financial reporting—for example, by the 
Internal Revenue Service.] 


A library subscribes to two different weekly 
news magazines, each of which is supposed 
to arrive in Wednesday’s mail. In actuality, 
each one may arrive on Wednesday, 
Thursday, Friday, or Saturday. Suppose the 
two arrive independently of one another, 
and for each one P(W) = .3, P(Th) = .4, 
P(F) = .2, and P(S) = .1. Let Y = the num- 
ber of days beyond Wednesday that it takes 
for both magazines to arrive (so possible 
Y values are 0, 1,2, or 3). Compute the pmf of 
Y. [Hint: There are 16 possible outcomes; 
Y(W, W) = 0, Y(F, Th) = 2, and so on.] 
Refer to Exercise 13, and calculate and 
graph the cdf F(x). Then use it to calculate 
the probabilities of the events given in parts 
(a)-(d) of that problem. 

Let X denote the number of vehicles queued 
up at a bank’s drive-up window at a partic- 
ular time of day. The cdf of X is as follows: 


0 x<0 
06 O0<x<1 
19 L<x<2 
39 2<x<3 
BY) = S67 32 eed 
92 4<x<5 
97 S<x<6 
1 6<x 
Calculate the following probabilities 


directly from the cdf: 


a. p(2), that is, P(X = 2) 
b. P(X > 3) 

ce P22 < X < 5) 

d. P2<X<5) 


An insurance company offers its policy- 
holders a number of different premium 
payment options. For a randomly selected 
policyholder, let X = the number of months 
between successive payments. The cdf of 
X is as follows: 
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0 x«<il 
30 1<x<3 
40 3<x<4 
F(x) = 9 (45 4<x<6 
60 6<x<12 
1 12<x 


a. What is the pmf of X? 
b. Using just the cdf, compute 
PB < X < 6) and P4 < X). 


24. In Example 3.10, let Y= the number of 


25. 


nonacknowledgements before the experi- 
ment terminates. With p= P(A) and 
1 — p = P(N), what is the pmf of Y? [Hint: 
First list the possible values of Y, starting 
with the smallest, and proceed until you see 
a general formula. ] 


Alvie Singer lives at 0 in the accompanying 
diagram and has four friends who live at A, 
B, C, and D. One day Alvie decides to go 
visiting, so he tosses a fair coin twice to 
decide which of the four to visit. Once at a 
friend’s house, he will either return home or 
else proceed to one of the two adjacent 
houses (such as 0, A, or C when at B), with 
each of the three possibilities having prob- 
ability 1/3. In this way, Alvie continues to 
visit friends until he returns home. 


i 
Z\ 


D Cc 


26. 


27. 


a. Let X = the number of times that Alvie 
visits a friend. Derive the pmf of X. 

b. Let Y=the number of straight-line 
segments that Alvie traverses (includ- 
ing those leading to and from 0). What 
is the pmf of Y? 

c. Suppose that female friends live at 
A and C and male friends at B and D. If 
Z =the number of visits to female 
friends, what is the pmf of Z? 


After all students have left the classroom, a 
statistics professor notices that four copies 
of the text were left under desks. At the 
beginning of the next lecture, the professor 
distributes the four books in a completely 
random fashion to each of the four students 
(1, 2, 3, and 4) who claim to have left 
books. One possible outcome is that 1 
receives 2’s book, 2 receives 4’s book, 3 
receives his or her own book, and 4 
receives 1’s book. This outcome can be 
abbreviated as (2, 4, 3, 1). 
a. List the other 23 possible outcomes. 
b. Let X denote the number of students 
who receive their own book. Determine 
the pmf of X. 


Show that the cdf F(x) is a nondecreasing 
function; that is, x, <x implies that 
F(x,) < F(x2). Under what condition will 
FX) = FQ)? 


3.3 Expected Values of Discrete Random Variables 


In Example 3.14, we considered a university with 15,000 students and let X = the number of courses 
for which a randomly selected student is registered. The pmf of X follows. Since p(1) = .01, we know 
that (.01) - (15,000) = 150 of the students are registered for one course, and similarly for the other 
x values. 
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x 1 2 3 4 5 6 ag 
P(x) Ol 03 13) 25 6.39 = =AT 02 ee 
Number registered | 150 450 1950 3750 5850 2550 300 


To compute the average number of courses per student, i.e., the average value of X in the 
population, we should calculate the total number of courses and divide by the total number of 
students. Since each of 150 students is taking one course, these 150 contribute 150 courses to the 
total. Similarly, 450 students contribute 2(450) courses, and so on. The population average value of 
X is then 


1(150) + 2(450) + 3(1950) + --- +7(300) 
15,000 


= 4.57 (3.7) 


Since 150/15,000 = .01 = p(1), 450/15,000 = .03 = p(2), and so on, an alternative expression for 
(3.7) is 


phe pose Sipe (3.8) 


Expression (3.8) shows that to compute the population average value of X, we need only the possible 
values of X along with their probabilities (proportions). In particular, the population size is irrelevant 
as long as the pmf is given by (3.6). The average or mean value of X is then a weighted average of the 
possible values 1, ..., 7, where the weights are the probabilities of those values. 


The Expected Value of X 


DEFINITION Let X be a discrete rv with set of possible values D and pmf p(x). 
The expected value or mean value of X, denoted by E(X) or wy or 
just yu, 1s 


E(X) = ay = n= > x-pa) 
xeED 
This expected value will exist provided that }°.-, |x| - p(x) <oo. 


Example 3.15 For the pmf in (3.6), 


w= 1 p(1)+2-pQ)+ +» +7-p(7) 
(1)(.01) + (2)(.03) + --- + (7)(.02) 
= 014+ .06+ .39+1.00+ 1.95+1.02+ .14 = 4.57 


l| 


If we think of the population as consisting of the X values 1, 2, ..., 7, then uw = 4.57 is the population 
mean (we will often refer to pp as the population mean rather than “the mean of X in the population’). 
Notice that w here is not 4, the ordinary average of 1, ..., 7, because the distribution puts more weight 
on 4, 5, and 6 than on other X values. | 


In Example 3.15, the expected value pp was 4.57, which is not a possible value of X. The word 
expected should be interpreted with caution because one would not expect to see an X value of 4.57 
when a single student is selected. 
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Example 3.16 Just after birth, each newborn child is rated on a scale called the Apgar scale. The 
possible ratings are 0, 1, ..., 10, with the child’s rating determined by color, muscle tone, respiratory 
effort, heartbeat, and reflex irritability (the best possible score is 10). Let X be the Apgar score of a 
randomly selected child born at a certain hospital during the next year, and suppose that the pmf of X is 


x 0 1 2 3 4 2) 6 7 8 9 10 
P(x) .002 001 002 .005 02 .04 18 37 25 12 O01 


Then the mean value of X is 


E(X) = = (0)(.002) + (1)(.001) + (2)(.002) 
+ +++ +(8)(.25) + (9)(.12) + (10)(.01) = 7.15 


(Again, jis not a possible value of the variable X.) If the stated model is correct, then the mean Apgar 
score for the population of all children born at this hospital next year will be 7.15. i | 


Example 3.17 Let X =1 if a randomly selected component needs warranty service and = 0 
otherwise. If the chance a component needs warranty service is p, then X is a Bernoulli rv with pmf 
p(\) = p and p(0) = | — p, from which 


E(X) =0-p(0)+1-p(1) =0(1—p)+1(p) =p 


That is, the expected value of X is just the probability that X takes on the value 1. If we conceptualize 
a population consisting of 0’s in proportion 1 — p and 1’s in proportion p, then the population average 
is W =p. a 


There is another frequently used interpretation of uw. Consider observing a first value x of our 
random variable, then observe independently another value, then another, and so on. If after a very 
large number of x values we average them, the resulting sample average will typically be close to wu; a 
more rigorous version of this statement is provided by the Law of Large Numbers in Chapter 6. That 
is, “ can be interpreted as the long-run average value of X when the experiment is performed 
repeatedly. This interpretation is often appropriate for games of chance, where the “population” is not 
a concrete set of individuals but rather the results of all hypothetical future instances of playing the 
game. 


Example 3.18 A standard American roulette wheel has 38 spaces. Players bet on which space a 
marble will land in once the wheel has been spun. One of the simplest bets is based on the color of the 
space: 18 spaces are black, 18 are red, and 2 are green. So, if a player “bets on black,” s/he has an 
18/38 chance of winning. Casinos consider color bets an “even wager,” meaning that a player who 
wages $1 on black, say, will profit $1 if the marble lands in a black space (and lose the wagered $1 
otherwise). 

Let X = the return on a $1 wager on black. Then the pmf of X is 


x -$1 +$1 
p(x) 20/38 18/38 


and the expected value of X is E(X) = (-1)(20/38) + (1)(18/38) = —2/38 = —$.0526. If a player 
makes $1 bets on black on successive spins of the roulette wheel, in the long run s/he can expect to 
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lose about 5.26 cents per wager. Since players don’t necessarily make a large number of wagers, this 
long-run average interpretation is perhaps more apt from the casino’s perspective: in the long run, 
they will gain an average of 5.26 cents for every $1 wagered on black at the roulette table. Hl 


Thus far, we have assumed that the mean of any given distribution exists. If the set of possible 
values of X is unbounded, so that the sum for ply is actually an infinite series, the expected value of 
X might or might not exist, depending on whether the series converges or diverges. 


Example 3.19 From Example 3.10, the general form for the pmf of X = number of attempts 
required to successfully transmit one message is 


From the definition, 


(3.9) 


If we interchange the order of taking the derivative and the summation, the sum is that of a geometric 
series. A little calculus reveals that the final result is E(X) = 1/p. If p is near 1, we expect a successful 
transmission very soon, whereas if p is near 0, we expect many attempts before the first success. For 
p= .5, EX) =2. BH 


Example 3.20 Let X, the number of interviews a student has prior to getting a job, have pmf 
p(x) =k/x?  x=1,2,3,... 


where k is chosen so that )*°, (k/x”) = 1. [Because $>°°, (1/x*) = 27/6, the value of k is 6/x”.] 
The expected value of X is 


Sk | 
w= E(X) =) x5 =k) (3.10) 
x=1 x=1 


The sum on the right of Equation (3.10) is the famous harmonic series of mathematics and can be 
shown to equal co. E(X) is not finite here because p(x) does not decrease sufficiently fast as x in- 
creases; statisticians say that the probability distribution of X has “a heavy tail.” If a sequence of 
X values is chosen using this distribution, the sample average will not settle down to some finite 
number but will tend to grow without bound. 

Statisticians use the phrase “heavy tails” in connection with any distribution having a large amount 
of probability far from p (so heavy tails do not require 44 = oo). Such heavy tails make it difficult to 
make inferences about p. a 


The Expected Value of a Function 
Often we will be interested in the expected value of some function h(X) rather than X itself. An easy 
way of computing the expected value of h(X) is suggested by the following example. 
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Example 3.21 The cost of a certain vehicle diagnostic test depends on the number of cylinders X in 
the vehicle’s engine. Suppose the cost function is h(X) = 20 + 3X + 5X*, Since X is a random 
variable, so is Y = h(X). The pmf of X and the derived pmf of Y are as follows: 


x 4 6 8 y 40 56 716 
p(x) 5 3 2 py) 5 3 2 


With D* denoting the possible values of Y, 
E[h(X)] = E(Y) = S~ y= p(y) = (40)(.5) + (56)(.3) + (76)(.2) = $52 
= (3.11) 
= h(4) (5) +A(6) - (3) +A(8) - (2) = } 7 AG) - ep) 


According to Expression (3.11), it was not necessary to determine the pmf of Y to obtain E(Y); 
instead, the desired expected value is a weighted average of the possible h(x) (rather than x) values. 


LAW OF THE If the rv X has a set of possible values D and pmf p(x), then the expected 
UNCONSCIOUS value of any function h(X), denoted by E[h(X)] or fax, is computed by 


STATISTICIAN 
= J h(x) - (a) 


xe€D 


assuming that >>), |h(x)| - p(x) <oo 


According to this proposition, E[h(X)] is computed in the same way that E(X) itself is, except that 
h(x) is substituted in place of x. That is, E[h(X)] is a weighted average of possible h(X) values, where 
the weights are the probabilities of the corresponding original X values. 


Example 3.22 A computer store has purchased three computers at $500 apiece. It will sell them for 
$1000 apiece. The manufacturer has agreed to repurchase any computers still unsold after a specified 
period at $200 apiece. Let X denote the number of computers sold, and suppose that p(0) = .1, 
p() = .2, p(2) = .3, and p(3) = .4. With h(X) denoting the profit associated with selling X units, the 
given information implies that A(X) = revenue — cost = 1000X + 200(3 — X) — 1500 = 
800X — 900. The expected profit is then 


E|h(X)| = A(0) - pO) +A) - pL) + A(2) - p(2) + A(3) - p(B) 
= (—900)(-1) + (—100) (.2) + (700)(.3) + (1500) (.4) 
= $700 a 


Because an expected value is a sum, it possesses the same properties as any summation; specifically, 
the expected value “operator” can be distributed across addition and across multiplication by con- 
stants. This important property is known as linearity of expectation. 
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LINEARITY OF For any functions h,(X) and h»(X) and any constants a, a, and b, 
EXPECTATION 
Elayhy(X) + agh2(X) +b] = ay E[h(X)| + a2Elho(X)] +b 


In particular, for any linear function aX + b, 
E(aX +b) =a-E(X)+b (3.12) 


(or, using alternative notation, Lax15 = a ~ Lx + dD). 


Proof Let h(X) = ayhy(X) + aoh2(X) +b, and apply the Law of the Unconscious Statistician: 
Elayhy(X) + ayho(X) +b] = S— (any (x) + azha(x) +5) - p(x) 


= ay) h(x)» p(x) +42 7 ha(x) pe) +6 >| la) 


D 
(distributive property of addition) 
= a Elhy(X)] + aE |ho(X)| + B[l] = aE [A (X)] + aE [ho (X)] +b 


The special case of aX + b is obtained by setting a; = a, h,(X) = X, and az = 0. a 


By induction, linearity of expectation applies to any finite number of terms. In Example 3.21, 
straightforward computation gives E(X) = 4(.5) + 6(.3) + 8(.2) = 5.4 and E(X*) = 57 x*- p(x) = 
4?(.5) +67(.3) + 87(.2) = 31.6. Applying linearity of expectation to Y = h(X) = 20 + 3X + .5X’, we 
obtain 


fy = E[20+ 3X +.5X’] = 20+ 3E(X) + .5E(X’) = 20+ 3(5.4) + .5(31.6) = $52, 


which matches the result of Example 3.21. 

The special case (3.12) states that the expected value of a linear function equals the linear 
function evaluated at the expected value E(X). Because h(X) in Example 3.22 is linear and 
E(X) = 2, E[h(X)] = 800(2) — 900 = $700, as before. Two special cases of (3.12) yield two 
important rules of expected value: 


1. For any constant a, Max = a - Mx (take b = 0). 
2. For any constant b, uxip = Ux + b = E(X) + b (take a = 1). 


Multiplication of X by a constant a changes the unit of measurement (from dollars to cents, where 
a = 100; inches to cm, where a = 2.54; etc.). Rule 1 says that the expected value in the new units 
equals the expected value in the old units multiplied by the conversion factor a. Similarly, if the 
constant b is added to each possible value of X, then the expected value will be shifted by that same 
constant amount. 

One commonly made error is to substitute wy directly into the function h(X) when h is a nonlinear 
function, in which case (3.12) does not apply. Consider Example 3.21: the mean of X is 5.4, and it’s 
tempting to infer that the mean of Y=h(X) is simply h(5.4). However, since the function 
h(X) = 20 + 3X + 5X? is not linear in X, this does not yield the correct answer: 
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h(5.4) = 20+ 3(5.4) + .5(5.4)° = $50.78 £ $52 = py 


In general, 4,,y) does not equal h(x) unless the function h(x) is linear. 


The Variance and Standard Deviation of X 

The expected value of X describes where the probability distribution is centered. Using the physical 
analogy of placing point mass p(x) at the value x on a one-dimensional axis, if the axis were then 
supported by a fulcrum placed at 1, there would be no tendency for the axis to tilt. This is illustrated 
for two different distributions in Figure 3.7. 


a b 

P(x) P(x) 

5 5 
i *« & -2& & 1 2 3 5 6 7 8 


Figure 3.7 Two different probability distributions with yp = 4 


Although both distributions pictured in Figure 3.7 have the same mean/fulcrum yp, the distribution 
of Figure 3.7b has greater spread or variability or dispersion than does that of Figure 3.7a. We will 
use the variance of X to assess the amount of variability in (the distribution of) X, just as s? was used 
in Chapter | to measure variability in a sample. 


DEFINITION 
Let X have pmf p(x) and expected value yu. Then the variance of X, 


denoted by V(X) or ox or just a”, is 


V(X) = S20 [= 0)? - p)] = El(X — 0)" 


D 


The standard deviation of X, denoted by SD(X) or cy or just a, is 


Ox =v V(X) 


The quantity h(X) = (X — )° is the squared deviation of X from its mean, and o7 is the expected 
squared deviation—i.e., a weighted average of the squared deviations from uw. Taking the square root 
of the variance to obtain the standard deviation returns us to the original units of the variable; e.g., if 
X is measured in dollars, then both w and o also have units of dollars. If most of the probability 
distribution is close to 1, as in Figure 3.7a, then o will typically be relatively small. However, if there 
are x values far from jy that have large probabilities (as in Figure 3.7b), then o will be larger. 
Intuitively, the value of o describes a typical deviation from w. 


Example 3.23 Consider again the distribution of the Apgar score X of a randomly selected newborn 
described in Example 3.16. The mean value of X was calculated as pw = 7.15, so 
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= (0 — 7.15)7(.002) + --. + (10 — 7.15)?(.01) = 1.5815 


The standard deviation of X is o = V1.5815 = 1.26. | 


When the pmf p(x) specifies a mathematical model for the distribution of population values, 07 is 


the population variance, and o is the population standard deviation. 


Properties of Variance 
An alternative to the defining formula for V(X) reduces the computational burden. 


PROPOSITION V(X) =e = F(X’) — 2? 


This equation is referred to as the variance shortcut formula. 


In using this formula, E(X°) is computed first without any subtraction; then “ is computed, squared, 
and subtracted (once) from E(X°). This formula is more efficient because it entails only one sub- 
traction, and E(X*) does not require calculating squared deviations from w. 


Example 3.24 Referring back to the Apgar score scenario of Examples 3.16 and 3.23, 


E(X?) = SZ - p(x) = (07)(.002) + (17)(.001) + --- +(107)(.01) = 52.704 


Thus, a? = 52.704 — (7.15) = 1.5815 as before, and again o = 1.26. | 


Proof of the Variance Shortcut Formula Expand (X — uy in the definition of V(X), and then 
apply linearity of expectation: 


V(X) = El(X — 17] = E[X? — 2pX + 2) 
= E(X?) — 2uE(X)+ yw? 
(by linearity of expectation) 
= E(X’) — 2y- pt pe = E(X’) — 2? + 2 
= E(x’) — 2 a 


The quantity E(X’) in the variance shortcut formula is called the mean-square value of the random 
variable X. Engineers may be familiar with the root-mean-square, or RMS, which is the square root of 
E(X’). Do not confuse this with the square of the mean of X, i.e. v7! For example, if X has a mean of 
7.15, the mean-square value of X is not GAs. because h(x) = x° is not linear. (In Example 3.24, the 
mean-square value of X is 52.704.) It helps to look at the two formulas side by side: 
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2 
E(X?) = ye -p(x) versus pw? = (s>=-n00 
D D 


The order of operations is clearly different. In fact, it can be shown using the variance shortcut 
formula that E(X?) > yp? for every random variable, with equality if and only if X is constant. 

The variance of a function h(X) is the expected value of the squared difference between h(X) and 
its expected value: 


2 
VAX] = obey) = D> [AC) = tay)? -e0)] = » I(x) 6) - > h(x) rt) 


When A(x) is a linear function, V[M(X)] simplifies considerably (see Exercise 42 for a proof). 


PROPOSITION V(aX+b)=o2,,,=a@ -o% and cays, =|al- ay (3.13) 
In particular, 


Oax = |al-oy and oys, = ox 


The absolute value is necessary because a might be negative, yet a standard deviation cannot be. 
Usually multiplication by a corresponds to a change in the unit of measurement (e.g., kg to lb or 
dollars to euros); the sd in the new unit is just the original sd multiplied by the conversion factor. On 
the other hand, the addition of the constant b does not affect the variance, which is intuitive, because 
the addition of b changes the location (mean value) but not the spread of values. Together, (3.12) and 
(3.13) comprise the rescaling properties of mean and standard deviation. 


Example 3.25 In the computer sales scenario of Example 3.22, E(X) = 2 and 
E(X?) = (07)(-1) + (17)(.2) + (2°)(.3) + (3°)(.4) = 5 
so V(X) = 5 — (2)° = 1. The profit function Y = A(X) = 800X — 900 is linear, so (3.13) applies with 


a = 800 and b = -900. Hence Y has variance a2, = (800)*(1) = 640,000 and standard deviation 
$800. a 


Exercises: Section 3.3 (28-45) 


28. The pmf for X =the number of major c. The standard deviation of X 
defects on a randomly selected appliance of d. V(X) using the shortcut formula 
a certain type is 29. An individual who has automobile insur- 
2 0 12 3 4 ance from a company is randomly selected. 


Let Y be the number of moving violations 
for which the individual was cited during 
the last 3 years. The pmf of Y is 


p(x) 08 15 45 27 .05 


Compute the following: 


a. E(X) y 0 1 2 3 
b. V(X) directly from the definition py) 60 25 10 05 
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a. Compute E(Y). 

b. Suppose an individual with Y viola- 
tions incurs a surcharge of $100Y’. 
Calculate the expected amount of the 
surcharge. 


30. Refer to Exercise 12 and calculate V(Y) and 


31. 


32. 


33. 


oy. Then determine the probability that Y is 
within 1 standard deviation of its mean value. 


An appliance dealer sells three different 
models of upright freezers having 13.5, 
15.9, and 19.1 cubic feet of storage space, 
respectively. Let X = the amount of storage 
space purchased by the next customer to 
buy a freezer. Suppose that X has pmf 


x 13.5 15.9 19.1 


px) 2 5 3 


a. Compute F(X), E(X’), and V(X). 

b. If the price of a freezer having capacity 
X cubic feet is 17X + 180, what is the 
expected price paid by the next cus- 
tomer to buy a freezer? 

c. What is the standard deviation of the 
price 17X +180 paid by the next 
customer? 

d. Suppose that although the rated 
capacity of a freezer is X, the actual 
capacity is h(X) = X — .01X?. What is 
the expected actual capacity of the 
freezer purchased by the _ next 
customer? 


Let X be a Bernoulli rv with pmf as in 
Example 3.17. 


a. Compute E(X?). 

b. Show that V(X) = p(1 — p). 

c. Compute E(x”). 

Suppose that the number of plants of a 
particular type found in a rectangular region 
(called a quadrat by ecologists) in a certain 
geographic area is a rv X with pmf 


p@)sc/x fors=1,2,3,... 


Is E(X) finite? Justify your answer (this is 
another distribution that statisticians would 
call heavy-tailed). 
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34. A small market orders copies of a certain 


35. 


36. 


37. 


magazine for its magazine rack each week. 
Let X = demand for the magazine, with 
pmf 


2 iO: & & § 6 
pix) 1/15 2/15 3/15 4/15 3/15 2/15 


Suppose the store owner actually pays 
$2.00 for each copy of the magazine and 
the price to customers is $4.00. If magazi- 
nes left at the end of the week have no 
salvage value, is it better to order three or 
four copies of the magazine? [Hint: For 
both three and four copies ordered, express 
net revenue as a function of demand X, and 
then compute the expected revenue. ] 


Let X be the damage incurred (in $) in a 
certain type of accident during a given year. 
Possible X values are 0, 1000, 5000, and 
10,000, with probabilities .8, .1, .08, and 
.02, respectively. A particular company 
offers a $500 deductible policy. If the 
company wishes its expected profit to be 
$100, what premium amount should it 
charge? 

The v candidates for a job have been ranked 
1, 2, 3, ..., n. Let X = the rank of a ran- 
domly selected candidate, so that X has pmf 


p(x) =1/n x=1,2,3,...,n 


(this is called the discrete uniform distri- 
bution). Compute E(X) and V(X) using the 
shortcut formula. [Hint: The sum of the first 
n positive integers is n(n + 1)/2, whereas 
the sum of their squares is given by 
n(n + 1)(2n + 1)/6.] 


Let X = the outcome when a fair die is 
rolled once. If before the die is rolled you 
are offered either $100 dollars or h(X) = 
350/X dollars, would you accept the guar- 
anteed amount or would you gamble? 
[Hint: Determine E[h(X)], but be careful: 
the mean of 350/X is not 350/u.] 
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38. 


39. 
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A supply company currently has in stock 
500 lb of fertilizer, which it sells to cus- 
tomers in 10-pound bags. Let X equal the 
number of bags purchased by a randomly 
selected customer. Sales data shows that 
X has the following pmf: 


x 1 2 3 4 
p(x) 2 A ) 1 


a. Compute the average number of bags 
bought per customer. 

b. Determine the standard deviation for 
the number of bags bought per 
customer. 

c. Define Y to be the amount of fertilizer 
left in stock, in pounds, after the first 
customer. Construct the pmf of Y. 

d. Use the pmf of Y to find the expected 
amount of fertilizer left in stock, in 
pounds, after the first customer. 

e. Write Y as a linear function of X. Then 
use rescaling properties to find the 
mean and standard deviation of Y. 

f. The supply company offers a discount 
to each customer based on the formula 
W=(X- 1). Determine the expected 
discount for a customer. 

g. Does your answer to part (f) equal 
(uy — 1)?? Why or why not? 

h. Calculate the standard deviation of W. 

Refer back to the roulette scenario in 

Example 3.18. Two other ways to wager at 

roulette are betting on a single number, or 

on a four-number “square.” The pmfs for 
the returns on a $1 wager on a number and 

a square are displayed below. (Payoffs for 

winning are always based on the odds of 

losing a wager under the assumption the 
two green spaces didn’t exist.) 

Single number: 


x —$1 
37/38 


+$35 


p(x) 1/38 


40. 


Square: 
x -$1 +$8 
p(x) 34/38 4/38 


a. Determine the expected return from a 
$1 wager on a single number, and then 
on a square. 

b. Compare your answers from (a) to 
Example 3.18. What can be said about 
the expected return for a $1 wager? 
Based on this, does expected return 
reflect most players’ intuition that bet- 
ting on black is “safer” and betting on a 
single number is “riskier”? 

c. Calculate the standard deviations for 
the two pmfs above as well as the pmf 
in Example 3.18. 

d. How do the standard deviations of the 
three betting schemes (color, single 
number, square) compare? How do 
these values appear to relate to players’ 
intuitive sense of risk? 


In the popular game Plinko on The Price Is 
Right, contestants drop a circular disk (a 
“chip”) down a pegged board; the chip 
bounces down the board and lands in a slot 
corresponding to one of five dollar mounts. 
The random variable X = winnings from 
one chip dropped from the middle slot has 
roughly the following distribution. 


x $0 $100 $500 $1000 $10,000 


p(x) 0.39 0.03 0.11 0.24 0.23 


a. Graph the probability mass function of 
X. 

b. What is the probability a contestant 
makes money on a chip? 

c. What is the probability a contestant 
makes at least $1000 on a chip? 

d. Determine the expected winnings. 
Interpret this number. 

e. Determine the corresponding standard 
deviation. 


3.3 Expected Values of Discrete Random Variables 


41. 


42. 


43. 


a. Draw a line graph of the pmf of X in 
Exercise 34. Then determine the pmf 
of —X and draw its line graph. From 
these two pictures, what can you say 
about V(X) and V(—X)? 

b. Use the proposition _ involving 
V(aX + b) to establish a general rela- 
tionship between V(X) and V(—X). 


Use the definition of variance to prove that 
V(aX +b) = a’. [Hint: With Y = aX + b, 
E(Y) = ap + b where pw = E(X).] 

Suppose E(X) = 5 and E[X(X — 1)] = 27.5. 
What is 

a. E(X°)? [Hint: E[X(X — 1)] = EX’ — X] = 
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45. A result called Chebyshev’s inequality 


states that for any probability distribution of 
a rv X and any number Xk that is at least 1, 
P(\X — p| >ko) <1/k?. In words, the 
probability that the value of X lies at least 
k standard deviations from its mean is at 
most 1/k. 


a. What is the value of the upper bound 
for k=2? k=3? k=4? k=5? 
k= 10? 

b. Compute yw and o for the distribution 
given in Exercise 13. Then evaluate 
P(\X — | >ko) for the values of 
k given in part (a). What does this 


suggest about the upper bound relative 
to the corresponding probability? 

c. LetX have possible values, —1, 0, and 1, 
with probabilities 1/18, 8/9, and 1/18, 


F(X’) — E(X).] 
V(X)? 


c. The general relationship among the 


tities E(X), E[X(X — 1)], d . ; 
cuaptines: 2) Oe Han respectively. What is P(|X — | > 30), 
V(X)? ; 
and how does its value compare to the 
44. Write a general rule for E(X — c) where c is corresponding bound? 
a constant. What happens when you let d. Give a distribution for which 


c = p, the expected value of X? P(\X — nu] > 5c) = 04. 


3.4 Moments and Moment Generating Functions 


The expected values of integer powers of X and X — yw are often referred to as moments, terminology 
borrowed from physics. In this section, we’ll discuss the general topic of moments and develop a 
shortcut for computing them. 


DEFINITION 


The kth moment of a random variable X is E(X*), while the kth moment 
about the mean (or kth central moment) of X is E[(X — wy"), 
where “ = E(X). 


For example, u = E(X) is the “first moment” of X and corresponds to the center of mass of the 


distribution of X. Similarly, V(X) = E[(X—1)’] is the second moment of X about the mean, which is 
known in physics as the moment of inertia. 


Example 3.26 A popular brand of dog food is sold in 5, 10, 15, and 20 Ib bags. Let X be the weight 
of the next bag purchased, and suppose the pmf of X is 
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The first moment of X is its mean: 


= E(X) =) xp(x) = 5(.1) + 10(.2) + 15(.3) + 20(.4) = 15 Ibs 


xeED 


The second moment about the mean is the variance: 


o° = El(X— n)] = >) (@— »)’P@) 


xe€D 


= (5 — 15)°(.1) + (10 — 15)?(.2) + (15 — 15)°(.3) + (20 — 15)?(.4) = 25, 
for a standard deviation of 5 lbs. The third central moment of X is 


E[(X — 03] = > («8 pe) 


xe€D 


= (5 — 15)*(.1) + (10 — 15)3(.2) + (15 — 15)7(.3) + (20 — 15)3(.4) = —75 
We’ll discuss an interpretation of this last number next. a 


It is not difficult to verify that the third moment about the mean is 0 if the pmf of X is symmetric. So, 
we would like to use E[(X — s1)*] as a measure of lack of symmetry, but it depends on the scale of 
measurement. If we switch the unit of weight in Example 3.26 from pounds to ounces or kilograms, the 
value of the third moment about the mean (as well as the values of all the other moments) will change. 
But we can achieve scale independence by dividing the third moment about the mean by o°: 


(* - ") (3.14) 


Expression (3.14) is our measure of departure from symmetry, called the skewness coefficient. The 
skewness coefficient for a symmetric distribution is 0 because its third moment about the mean is 0. 


E|(X — #)’] 


=E 
) 


However, in the foregoing example the skewness coefficient is E[(X—p)*]/o3 = —75/53 = —0.6. 
When the skewness coefficient is negative, as it is here, we say that the distribution is negatively 
skewed or that it is skewed to the left. Generally speaking, it means that the distribution stretches 
farther to the left of the mean than to the right. 

If the skewness coefficient were positive, then we would say that the distribution is positively 
skewed or that it is skewed to the right. For example, reverse the order of the probabilities in the pmf 
of Example 3.26, so the probabilities of the values 5, 10, 15, 20 are now .4, .3, .2, and .1 (customers 
now favor much smaller bags of dog food). Exercise 61 shows that this changes the sign but not the 
magnitude of the skewness coefficient, so it becomes +0.6 and the distribution is skewed right. Both 
distributions are illustrated in Figure 3.8. 
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P(x) a P(x) b 
rN h 
0.4 5 0.4 + 
0.3 + 0.3 + 
0.2 + 0.2 
0.1 | 0.1 
0 > Xx 0 ~X 
5 10 15 20 5 10 15 20 


Figure 3.8 Departures from symmetry: (a) skewness coefficient < 0 (skewed left); 
(b) skewness coefficient > 0 (skewed right) 


The Moment Generating Function 

Calculation of the mean, variance, skewness coefficient, etc., for a particular discrete rv requires 
extensive, sometimes tedious, summation. Mathematicians have developed a tool, the moment gen- 
erating function, that will allow us to determine the moments of a distribution with less effort. 
Moreover, this function will allow us to derive properties of several important probability distribu- 
tions in subsequent sections of the book. 

Note first that e!’* is a particular function of X; its expected value is E(e!*) = S> e!* - p(x). The 
number 1.7 in the foregoing expression can be replaced by any other number—2.5, 179, —3.25, etc. 
Now consider replacing 1.7 by the letter t. Then the expected value depends on the numerical value of 
t; that is, E(e*) is a function of ¢. It is this function that will generate moments for us. 


DEFINITION The moment generating function (mgf) of a discrete random variable 
X is defined to be 


Mx(t) = E(e*) = S° e*p(x) 


xe€D 


where D is the set of possible X values. The moment generating function 
exists iff M,(t) is defined for an interval that includes zero as well as 
positive and negative values of t. 


For any random variable X, the mgf evaluated at ¢ = 0 is 


Mx(0) = E(e™) = S> p(x) = S> Ip) = 1 


xE€D xe€D 


That is, M(0) is the sum of all the probabilities, so it must always be 1. However, in order for the mgf 
to be useful in generating moments, it will need to be defined for an interval of values of t including 0 
in its interior. The moment generating function fails to exist in cases when moments themselves fail to 
exist (see Example 3.30 below). 


Example 3.27 The simplest example of an mgf is for a Bernoulli distribution, where only the 
X values 0 and | receive positive probability. Let X be a Bernoulli random variable with p(O) = 1/3 
and p(1) = 2/3. Then 
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1 
Mx(t) = E(e*) = S/ e*p(x) =e 3 +e ae Le 
xED 


A Bernoulli random variable will always have an mgf of the form p(0)+p(1)e’, a well-defined 
function for all values of t. H 


A key property of the mgf is its “uniqueness,” the fact that it completely characterizes the 
underlying distribution. 


MGF UNIQUENESS 
THEOREM 


If the mgf exists and is the same for two distributions, then the two 
distributions are identical. That is, the moment generating function 

uniquely specifies the probability distribution; there is a one-to-one 
correspondence between distributions and mgfs. 


The proof of this theorem, originally due to Laplace, requires some sophisticated mathematics and is 
beyond the scope of this textbook. 


Example 3.28 Let X, the number of claims submitted on a renter’s insurance policy in a given year, 
have mgf My(t) = .7+.2e' + .1e’. It follows that X must have the pmf p(0) = .7, p(1) = .2, and 
p(2) = .1—because if we use this pmf to obtain the mgf, we get My(t), and the distribution is uniquely 
determined by its mef. o 


Example 3.29 Consider testing individuals’ blood samples one by one in order to find someone 
whose blood type is Rh+. The rv X = the number of tested samples should follow the pmf specified in 
Example 3.10 with p = .85: 


p(x) = .85(.15)""' for x = 1,2,3,.... 


Determining the moment generating function here requires using the formula for the sum of a 


geometric series: 1 +r+r?7+ ---= 1/(1—7r) for |r| < 1. The moment generating function is 
Mx(t) = E(e*) = S~ e*p(x) = S° e*.85(.15)"! = 85e' $7 e157 
xeD x=1 x=1 
~ 85e 


= 85e' $0 (.15e')*! = 85e"[1 + .15e' + (.15e")? + ++] 


x=1 


~ 1— .15e 


The condition on r requires |.15e'] < 1. Dividing by .15 and taking logs gives t < —-In(.15) © 1.90; 
i.e., this function is defined in the interval (—oo, 1.90). The result is an interval of values that includes 
0 in its interior, so the mgf exists. As a check, Mx(0) = .85/(1 — .15) = 1, as required. a 


Example 3.30 Reconsider Example 3.20, where p(x) = k/x*, x = 1, 2, 3, .... Recall that E(X) does 
not exist for this distribution, portending a problem for the existence of the mgf: 
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With the help of tests for convergence such as the ratio test, we find that the series converges if and 
only if e’ <1, which means that t < 0; i.e., the mgf is only defined on the interval (—oo, 0]. Because 
zero is on the boundary of this interval, not the interior of the interval (the interval must include both 
positive and negative values), the mgf of this distribution does not exist. In any case, it could not be 
useful for finding moments, because X does not have even a first moment (mean). es} 


Obtaining Moments from the MGF 


For any positive integer r, let MY ) (t) denote the rth derivative of M,(t). By computing this and then 
setting t = 0, we get the 7th moment about 0. 


THEOREM If the mgf of X exists, then E(X’) is finite for all positive integers r, and 


E(X’) = M\(0) (3.15) 


Proof The proof of the finiteness of all moments is beyond the scope of this book. We will show that 
Expression (3.15) is true for r= 1 and r = 2. A proof by mathematical induction can be used for 
general r. The first derivative of the megf is 


d d 
qt) = Bd Pl = ae p(x =)" xe" P(x 


xe€D 


where we have interchanged the order of summation and differentiation. (This is justified inside the 
interval of convergence, which includes 0 in its interior.) Next set t = 0 to obtain the first moment: 


Mx(0) = My? (0) = D7 xe" p(x) = DV ap() = 


xe€D xeED 


Differentiating a second time gives 


d 0 a 
got) = 4 SoseH) = Dosen) = So 


xe€D xe€D xE€D 


Set t = 0 to get the second moment: 


My(0) = = p(x pas a 


x€D 


For the pmfs in Examples 3.27 and 3.28, this may seem like needless work—after all, for simple 
distributions with just a few values, we can quickly determine the mean, variance, etc. The real utility 
of the mgf arises for more complicated distributions. 


Example 3.31 (Example 3.29 continued) Recall that p = .85 is the probability of a person having 
Rh+ blood, and we keep checking people until we find one with this blood type. If X is the number of 


people we need to check, then p(x) = .85(.15)*"', x = 1, 2, 3, ..., and the mgf is 
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85e" 
Mx(t) = E(e*) = 
a) =Ehe") = ase 
Differentiating with the help of the quotient rule, 
85e! 
M(t) = 
(1 — .15e’) 


Setting t=0 then gives uw = E(X) = My(0) = 1/.85 = 1.176. This corresponds to the formula 
i = 1/p when .85 is replaced by p. 
To get the second moment, differentiate again: 


. «S2F(l 45a") 


MG (1 — .15e')? 


Setting t = 0, E(X?) = My(0) = 1.15/.85. Now use the variance shortcut formula: 


1Y 15 
V(X) = 0° = E(X’) — wv = 1.15/.857 ( )- = .2076 | 
(X) (X°) — - 85) (85)? 


There is an alternate way of doing the differentiation that can sometimes make the effort easier. Define 


Rx) = In[Mx(H], where In(w) is the natural log of wu. In Exercise 54 you are requested to verify that if 
the moment generating function exists, 


Example 3.32 Here we apply R(t) to Example 3.31. Using properties of logarithms, 


4 
Rx (t) = In[My(t)] = in) = In(.85) +f In(1 — .15e") 
— .15e 
The first derivative is 
1 15e! 1 
R(t) = 1 —-——__—_(-.15e') = 1 = 
a) 01 ga ge Te 
and the second derivative is 
15e! 
R" t= 
xl) (1 — .15et)? 
Setting ¢ to 0 gives 
1 
E(X) = R{,(0) = — 
= E(X) = Ry(0) = 3 
15 
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These are in agreement with the results of Example 3.31. Hi 


As mentioned in Section 3.3, it is common to transform a rv X using a linear function 
Y = aX + b. What happens to the mgf when we do this? 


PROPOSITION Let X have the mgf My(#) and let Y = aX + b. Then My(t) = e”'My(at). 


Example 3.33 Let X be a Bernoulli random variable with p(O) = 20/38 and p(1) = 18/38. Think of 
X as the number of wins, 0 or 1, in a single play of roulette. If you play roulette at an American casino 
and bet on red, then your chance of winning is 18/38 because 18 of the 38 possible outcomes are red. 
From Example 3.27, My(t) = 20/38 + e'(18/38). Suppose you bet $5 on red, and let Y be your 
winnings. If X = 0 then Y = —5, and if X = | then Y = + 5. The linear equation Y = 10X — 5 gives the 
appropriate relationship. 

This equation is of the form Y = aX + b with a = 10 and b = -5, so by the foregoing proposition 


My(t) = e”' My (at) = e My(10t) 


20 18 20 18 
eee ea) St, 
= E we = . 


This implies that the pmf of Y is p(—5) = 20/38 and p(5) = 18/38; moreover, we can compute the 
mean (and other moments) of Y directly from this megf. a 


Exercises: Section 3.4 (46-61) 


46. Let X be the number of pumps in use atagas 50. Calculate the skewness coefficient for each 


station, and suppose X has the distribution of the distributions in the previous four 
given by the accompanying table. Deter- exercises. Do those agree with the “shape” 
mine M,(f) and use it to find E(X) and V(X). of each distribution? 


51. Given My(t) = .2 + 3e + .5e”, find p(x), 
x 0 1 2 3 4 5 6 E(X), V(X). 


p(x) .04 20 34 20 5 04 03 52. If M,(t) = vd = ry, find E(X) and V(X) by 


bo ei eee differentiating My(2). 
47. In flipping a fair coin let X be the number of 


tosses to get the first head. Then p(x) = .5* 
for x = 1, 2,3, .... Determine M,(f) and use 
it to get E(X) and V(X). 54. Let My(t) be the moment generating func- 
tion of a rv X, and define Ry(t) = 
In[Mx(t)]. Show that 


53. Show that g(t) = fe’ cannot be a moment 
generating function. 


48. If you toss a fair die with outcome X, 
p(x) =! for x = 1, 2, 3, 4, 5, 6. Find Mx(2). 


49. For the entry-level employees of a certain - i — 
fast food chain, the pmf of X = highest grade : Z (0) = - 
level completed is specified by p(Q) = .01, c.  Ry(0) = ox 


p(10) = .05, p(11) = .16, and p(12) = .78. 55. If My(t) = e+?" then find E(X) and V(X) 


a. Determine the moment generating by differentiating 
function of this distribution. a. My(t) 
b. Use (a) to determine the mean and b. Rx(t) 


variance of this distribution. 
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56. If My(t) = e\*~)) then find E(X) and V(X) 59. Let My(t) = e+?" and let Y = (X — 5)/2. 


by differentiating Find My(t) and use it to find E(Y) and V(Y). 
a. My(t) 60. a. Prove the result in the last proposition 
b. Rx(t) of this section: Max +»(t) = e”'Mx(at). 
57. Using a calculation similar to the one in b. Lt an ax + b. Use (a) to establish the 
Example 3.29 show that, if X has the dis- relationships between the means and 


tribution of Example 3.10, then its mgf is Hannes Oban ahas 


61. Let X be the number of points earned by a 


My(t) = pe! randomly selected student on a 10 point 
1-—(1—p)e quiz, with possible values 0, 1, 2, ..., 10 
and pmf p(x), and suppose the distribution 
If Y has mgf My(t) = .75e'/(1 — .25e’), has a skewness of c. Now consider revers- 
determine the probability mass function ing the probabilities in the distribution, so 
Py(y) with the help of the uniqueness that p(0) is interchanged with p(10), p(1) is 
property. interchanged with p(9), and so on. Show 
58. Let X have the moment generating function that the skewness of the resulting distribu- 
of Example 3.29 and let Y = X — 1. Recall tion is —c. [Hint: Let Y = 10 — X and show 
that X is the number of people who need to be that Y has the reversed distribution. Use this 
checked to get someone who is Rh+, so Y is fact to determine jy and then the value of 
the number of people checked before the first skewness for the Y distribution. ] 


Rh+ person is found. Find M){f) using the 
last proposition in this section. 


3.5 The Binomial Probability Distribution 


Many experiments conform either exactly or approximately to the following list of requirements: 

1. The experiment consists of a sequence of n smaller experiments called trials, where n is fixed in 
advance of the experiment. 

2. Each trial can result in one of the same two possible outcomes (dichotomous trials), which we 
denote by success (S) or failure (F). 

3. The trials are independent, so that the outcome on any particular trial does not influence the 
outcome on any other trial. 

4. The probability of success is constant from trial to trial (homogeneous trials); we denote this 
probability by p. 


DEFINITION An experiment for which Conditions 1-4 are satisfied—a fixed number 
of dichotomous, independent, homogeneous trials—is called a binomial 
experiment. 


Example 3.34 The same coin is tossed successively and independently n times. We arbitrarily use 
S to denote the outcome H (heads) and F to denote the outcome T (tails). Then this experiment 
satisfies Conditions 1-4. Wagering on n spins of a roulette wheel, with S = win money and F = lose 
money, also results in a binomial experiment so long as you bet the same way every time (e.g., always 
on black, so that P(S) remains constant across different spins). Another binomial experiment was 
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alluded to in Example 3.10: there, a binomial experiment would consist of sending a fixed number 
n of messages across a communication channel, with S = message received correctly and 
F = received message contains errors. a 


Some experiments involve a sequence of independent trials for which there are more than two 
possible outcomes on any one trial. A binomial experiment can then be created by dividing the 
possible outcomes into two groups. 


Example 3.35 The color of pea seeds is determined by a single genetic locus. If the two alleles at 
this locus are AA or Aa (the genotype), then the pea will be yellow (the phenotype), and if the allele is 
aa, the pea will be green. Suppose we pair off 20 Aa seeds and cross the two seeds in each of the ten 
pairs to obtain ten new genotypes. Call each new genotype a success S if it is aa and a failure 
otherwise. Then with this identification of S and F, the experiment is binomial with n = 10 and 
p = P(aa genotype). If each member of the pair is equally likely to contribute either a or A, then 
p = P(a) - Pia) = (1/2)(1/2) = .25. i 


Example 3.36 A student has an iPod playlist containing 50 songs, of which 35 were recorded prior 
to the year 2018 and the other 15 were recorded more recently. Suppose the shuffle function is used to 
select five from among these 50 songs for listening during a walk between classes. Each selection of a 
song constitutes a trial; regard a trial as a success if the selected song was recorded before 2018. Then 


P(S on first trial) = = = .70 


and 
P(S on second trial) = P(SS) + P(FS) 
= P(second S|first S)P(first S) + P(second S|first F)P(first F) 
34 35 35 15 35 (5 a) 35 


meee = =— = J 
49 50 | 49 50 50 49 | 49 50 


Similarly, it can be shown that P(S on ith trial) = .70 for i = 3, 4, 5, so the trials are homogeneous. 
However, 


: : 31 
P(S on fifth trial | SSSS) = 2 .67 
whereas 
P(S on fifth trial| FFFF) = ze = .76 


The experiment is not binomial because the trials are not independent. In general, if sampling is 
without replacement, the experiment will not yield independent trials. If songs had been selected with 
replacement, then trials would have been independent, but this might have resulted in the same song 
being listened to more than once. a 


Example 3.37 Suppose a state has 500,000 licensed drivers, of whom 400,000 are insured. 
A sample of ten drivers is chosen without replacement. The ith trial is labeled S if the ith driver 
chosen is insured. Although this situation would seem identical to that of Example 3.36, the important 
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difference is that the size of the population being sampled is very large relative to the sample size. In 
this case 


399,999 
P 2 1lj= : = 41 a 
(S on 2|S on 1) 499,999 999996 = .8 
and 
399,991 
P 1 first 9) = : i) 4. 
(S on 10|S on first 9) 499,991 99996 8 


These calculations suggest that although the trials are not exactly independent, the conditional 
probabilities differ so slightly from one another that for practical purposes the trials can be regarded as 
independent with constant P(S) = .8. Thus, to a very good approximation, the experiment is binomial 
with n = 10 and p = 8. i | 


We will use the following convention in deciding whether a “without-replacement” experiment 
can be treated as a binomial experiment. 


RULE Consider sampling without replacement from a dichotomous population of size N. If 
the sample size (number of trials) n is at most 5% of the population size, the 
experiment can be analyzed as though it were exactly a binomial experiment. 


By “analyzed,” we mean that probabilities based on the binomial experiment assumptions will be 
quite close to the actual “without-replacement” probabilities, which are typically more difficult to 
calculate. In Example 3.36, n/N = 5/50 = .1 > .05, so the binomial experiment is not a good 
approximation, but in Example 3.37, n/N = 10/500,000 < .05. 


The Binomial Random Variable and Distribution 
In most binomial experiments, it is the total number of successes, rather than knowledge of exactly 
which trials yielded successes, that is of interest. 


DEFINITION Given a binomial experiment consisting of n trials, the binomial random 
variable X associated with this experiment is defined as 


X = the number of successes among the n trials 


Suppose, for example, that n = 3. Then there are eight possible outcomes for the experiment: 
S= {SSS, SSF, SFS, SFF, FSS, FSF, FFS, FFF} 


From the definition of X, X(SSF) = 2, X(SFF) = 1, and so on. Possible values for X in an n-trial 
experiment are x = 0, 1, 2, ..., 7. 


NOTATION We will write X ~ Bin(, p) to indicate that Xis a binomial rv based on trials 
with success probability p. Because the pmf of a binomial rv X depends on the 
two parameters n and p, we denote the pmf by b(; n, p). 


Our next goal is to derive a formula for the binomial pmf. Consider first the case n = 4 for which each 
outcome, its probability, and corresponding x value are listed in Table 3.1. For example, 
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Table 3.1 Outcomes and 


arobsbitties:toFa binciaial Outcome x Probability Outcome x Probability 

experiment with four trials SSSS 4 p FSSS 3 pl ~ p) 
SSSF 3 pi — p) FSSF 2 pl - p? 
SSFS 3 pd - p) FSFS 2 pd — py” 
SSFF 2 pl - p? FSFF 1 pd - py 
SFSS 3 pl — p) FFSS 2 pl - py 
SFSF 2 pd — py” FFSF 1 pa — py? 
SFFS 2 pl — py FFFS 1 pd - py 
SFFF 1 pd — p) FFFF 0 (1 — p)* 


P(SSFS) = P(S)-P(S)-P(F)-P(S) independent trials 
=p-p-(1—p)-p constant P(S) 


In this special case, we wish b(x; 4, p) for x = 0, 1, 2, 3, and 4. For b(3; 4, p), we identify which of the 
16 outcomes yield an x value of 3 and sum the probabilities associated with each such outcome: 


b(3; 4, p) = P(FSSS) + P(SFSS) + P(SSFS) + P(SSSF) = 4p°(1 — p) 


There are four outcomes with x = 3 and each has probability ge — p) (the probability depends only 
on the number of S’s, not the order of S’s and F’s), so 


pGa p= number of outcomes | J probability of any particular 
ePE = with X = 3 outcome with X = 3 


Similarly, b(2; 4, p) = 6p?(1 — p)’, which is also the product of the number of outcomes with X = 2 
and the probability of any such outcome. 
In general, 
number of sequences of probability of any 
b(x;n, p) = tie bbs : 
length n consisting of x Ss particular such sequence 
Since the ordering of S's and F"s is not important, the second factor in braces is p*(1 — p)"™~ (e.g., the 
first x trials resulting in S and the last n — x resulting in F). The first factor is the number of ways of 


choosing x of the n trials to be S’s—that is, the number of combinations of size x that can be 
constructed from n distinct objects (trials here). 


THEOREM on ("Jona ~py* «=0,1,2)..4n 


Example 3.38 Each of six randomly selected cola drinkers is given a glass containing cola S and 
one containing cola F. The glasses are identical in appearance except for a code on the bottom to 
identify the cola. Suppose there is no tendency among cola drinkers to prefer one cola to the other. 
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Then p = P(a selected individual prefers S) = .5, so with X = the number among the six who prefer 
S, X ~ Bin(6, .5). 
Thus 


P(X = 3) = (3; 6,.5) = ( 


The probability that at least three prefer S is 


Computing Binomial Probabilities 

Even for a relatively small value of n, the computation of binomial probabilities can be tedious. 
Software and statistical tables are both available for this purpose; both are often in terms of the cdf 
F(x) = P(X < x) of the distribution, either in lieu of or in addition to the pmf. Various other 
probabilities can then be calculated using the proposition on cdfs from Section 3.2. 


NOTATION For X ~ Bin(n, p), the cdf will be denoted by 


B(x; n,p) = P(X<x)= Jody; n,p) x=0,1,..40 
y=0 


Many software packages, including R, have built-in functions to evaluate both the pmf and cdf of the 
binomial distribution (and many other named distributions). Table 3.2 provides the code for per- 
forming binomial calculations in R. In addition, Appendix Table A.1 shows the binomial cdf for 
n=5, 10, 15, 20, 25 in combination with selected values of p. 


Table 3.2 Binomial probability calculations in R 


Function: pmf cdf 
Notation: b(x; n, p) B(x; n, p) 
R: dbinom(x, n, p) pbinom(x, n, p) 


Example 3.39 Suppose that 20% of all copies of a particular textbook fail a binding strength test. 
Let X denote the number among 15 randomly selected copies that fail the test. Then X has a binomial 
distribution with n = 15 and p = .2: X ~ Bin(S, .2). 


3.5 The Binomial Probability Distribution 149 


(a) The probability that at most 8 fail the test is 
8 
P(X <8) =) b(y; 15, .2) = B(8; 15, .2) 


This is found at the intersection of the p = .2 column and x = 8 row in the n = 15 part of 
Table A.1: B(8; 15, .2) = .999. In R, we may type pbinom(8,15, .2). 


(b) The probability that exactly 8 fail is P(X = 8) = b(8; 15, .2) = (2 )(2%(8) = .0034. We can 
evaluate this probability in R with the call dbinom(8,15, .2). To use Table A.1, write 


P(X = 8) = P(X <8) — P(X <7) = B(8; 15, .2) — B(7; 15, .2) 


which is the difference between two consecutive entries in the p = .2 column. The result is 
.999 — .996 = .003. 

(c) The probability that at least 8 fail is P(X > 8) = 1 — P(X <7) = 1 — B(7; 15, .2). The cdf may 
be evaluated using R as above, or by looking up the entry in the x = 7 row of the p = .2 column 
in Table A.1. In any case, we find P(X > 8) =1— .996 = .004. 

(d) Finally, the probability that between 4 and 7, inclusive, fail is 


P(4<X<7) =P(X =4,5,6, or 7) = P(X <7) — P(X <3) 
= B(7; 15, .2) — B(3;15,.2) = .996 — .648 = .348 


Notice that this latter probability is the difference between the cdf values at x = 7 and x = 3, not x = 7 
and x = 4. a 


Example 3.40 An electronics manufacturer claims that at most 10% of its power supply units need 
service during the warranty period. To investigate this claim, technicians at a testing laboratory 
purchase 20 units and subject each one to accelerated testing to simulate use during the warranty 
period. Let p denote the probability that a power supply unit needs repair during the period (the 
proportion of all such units that need repair). The laboratory technicians must decide whether the data 
resulting from the experiment supports the claim that p < .10. Let X denote the number among the 
20 sampled that need repair, so X ~ Bin(20, p). Consider the following decision rule: 


Reject the claim that p < .10 in favor of the conclusion that p > .10 ifx > 5 
(where x is the observed value of X), and consider the claim plausible if x < 4. 


The probability that the claim is rejected when p = .10 (an incorrect conclusion) is 
P(X >5 when p = .10) = 1 — B(4; 20, .1) = 1 — .957 = .043 
The probability that the claim is not rejected when p = .20 (a different type of incorrect conclusion) 
is 
P(X <4 when p = .2) = B(4; 20,.2) = .630 
The first probability is rather small, but the second is intolerably large. When p = .20, so that the 


manufacturer has grossly understated the percentage of units that need service, and the stated decision 
tule is used, 63% of all samples will result in the manufacturer’s claim being judged plausible! 
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One might think that the probability of this second type of erroneous conclusion could be made 
smaller by changing the cutoff value 5 in the decision rule to something else. However, although 
replacing 5 by a smaller number would yield a probability smaller than .630, the other probability 
would then increase. The only way to make both “error probabilities” small is to base the decision 
rule on an experiment involving many more units (i.e., to increase 7). a 


The Mean and Variance of a Binomial Random Variable 

For n = 1, the binomial distribution becomes the Bernoulli distribution. From Example 3.17, the 
mean value of a Bernoulli variable is 4 = p, so the expected number of S’s on any single trial is 
p. Since a binomial experiment consists of n trials, intuition suggests that for X ~ Bin(n, p), 
E(X) = np, the product of the number of trials and the probability of success on a single trial. The 
expression for V(X) is not so intuitive. 


PROPOSITION If X ~ Bin(n, p), then E(X) = np, V(X) = np( — p) = npg, and 
SD(X) = \/npg (where g = 1 — p). 


Thus, calculating the mean and variance of a binomial rv does not necessitate evaluating summations 
of the sort we employed in Section 3.3. The proof of the result for E(X) is sketched in Exercise 86, 
and both the mean and the variance are obtained below using the moment generating function. 


Example 3.41 If 75% of all purchases at a store are made with a credit card and X is the number 
among ten randomly selected purchases made with a credit card, then X ~ Bin(10,.75). Thus 
E(X) = np = (10)(.75) = 7.5, V(X) = npq = 10(.75)(.25) = 1.875, and o = V/1.875 = 1.37. 
Again, even though X can take on only integer values, E(X) need not be an integer. If we perform a 
large number of independent binomial experiments, each with n = 10 trials and p = .75, then the 
average number of S’s per experiment will be close to 7.5. fi 


An important application of the binomial distribution is to estimating the precision of simulated 
probabilities, as in Section 2.6. The relative frequency definition of probability justified defining an 
estimate of a probability P(A) by P(A) = X/n, where n is the number of runs of the simulation 
program and X equals the number of runs in which event A occurred. Assuming the runs of our 
simulation are independent (and they usually are), the rv X has a binomial distribution with 
parameters n and p = P(A). From the preceding proposition and the rescaling properties of mean and 
standard deviation, we have 


(np) = p = P(A) 


n n 


Thus we expect the value of our estimate to coincide with the probability being estimated, in the sense 
that there is no reason for P(A) to be systematically higher or lower than P(A). Also, 


SD(P(A)) = sp(-x) = | pO = . Jetje! = yO TEE) (3.16) 


n 


Expression (3.16) is called the standard error of P(A) (essentially a synonym for standard deviation) 
and indicates the amount by which an estimate P(A) “typically” varies from the true probability P(A). 
However, this expression isn’t of much use in practice: we most often simulate a probability when 
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P(A) is unknown, which prevents us from using (3.16). As a solution, we simply substitute the 
estimate P = P(A) into this expression and get 


This is the estimated standard error formula (2.9) given in Section 2.6. Very importantly, this esti- 
mated standard error gets closer to 0 as the number of runs, n, in the simulation increases. 


The Moment Generating Function of a Binomial Random Variable 
Determining the mgf of a binomial rv relies on the binomial theorem, which states that 


(a+b)" = ~_ é ) a‘b"~*, Using the definition, 


My(t) = E(e*) = »» e*p(x) = s ott (” ae _ pyr 
=>» & wey (1—p)”* = (pe’'+1—p)” 


x=0 


Notice that the mgf satisfies the property M(0) = | required of all moment generating functions. The 
mean and variance can be obtained by differentiating My(0): 


Mi (t)=n(pe'+1—p)" ‘pe’ and p= M{(0) =np 


Then the second derivative is 


t 


My(t) = n(n — 1)(pe' + 1 — p)" *pe'pe! + n(pe! +1 — p)" ‘pe 


and 
E(X”) = My(0) = n(n — 1)p* +np 


Therefore, 
a? = V(X) = E(X’) — [E(X)]’ 
= n(n — 1)p? +np — n°p? = np — np’ = np(1—p) 


in accord with the foregoing proposition. 


Exercises: Section 3.5 (62-88) 


62. Determine whether each of the following four choices and the student is com- 
rvs has a binomial distribution. If it does, pletely guessing 
identify the values of the parameters n and c. X=the same as (b), but half the 
p (if possible). questions have four choices and the 
a. X =the number of 4 s in 10 rolls of a other half have three 
fair die d. X =the number of women in a ran- 
b. X=the number of multiple-choice dom sample of 8 students, from a 
questions a student gets right on a 40- class comprised of 20 women and 15 


question test, when each question has men 
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64. 


65. 


66. 
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e. X= the total weight of 15 randomly 
selected apples 

f. X =the number of apples, out of a 
random sample of 15, that weigh more 
than 150 grams 


Compute the following binomial probabil- 
ities directly from the formula for b(x; n, p): 
b(3; 8, .6) 

b(5; 8, .6) 

P3 < X < 5) whenn = 8 andp = .6 
Pd < X) when n = 12 and p=.1 
Use Appendix Table A.1 or software to 
obtain the following probabilities: 

B(4; 10, .3) 

b(4; 10, .3) 

b(6; 10, .7) 

P22 < X < 4)whenX ~ Bin(10, .3) 
P(2 < X) when X ~ Bin(10, .3) 
P(X < 1) when X ~ Bin(10, .7) 

P(2 < X < 6) when X ~ Bin(10, .3) 
When circuit boards used in the manufac- 
ture of DVD players are tested, the long-run 
percentage of defectives is 5%. Let X = the 
number of defective boards in a random 
sample of sizen = 25,so X ~ Bin(25, .05). 


Determine P(X < 2). 

Determine P(X > 5). 

Determine P(1 < X < 4). 

What is the probability that none of the 
25 boards is defective? 

e. Calculate the expected value and stan- 
dard deviation of X. 


aoe 


memeaose 


ao of 


A company that produces fine crystal 
knows from experience that 10% of its 
goblets have cosmetic flaws and must be 
classified as “seconds.” 


a. Among six randomly selected goblets, 
how likely is it that only one is a second? 

b. Among six randomly selected goblets, 
what is the probability that at least two 
are seconds? 

c. If goblets are examined one by one, 
what is the probability that at most five 
must be selected to find four that are 
not seconds? 
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Suppose that only 25% of all drivers come 
to a complete stop at an intersection having 
flashing red lights in all directions when no 
other cars are visible. What is the proba- 
bility that, of 20 randomly chosen drivers 
coming to an intersection under these 
conditions, 


a. At most 6 will come to a complete 


stop? 

b. Exactly 6 will come to a complete 
stop? 

c. At least 6 will come to a complete 
stop? 


Refer to the previous exercise. 


a. What is the expected number of drivers 
among the 20 that come to a complete 
stop? 

b. What is the standard deviation of the 
number of drivers among the 20 that 
come to a complete stop? 

c. What is the probability that the number 
of drivers among these 20 that come to 
a complete stop differs from the 
expected number by more than 2 
standard deviations? 


Exercise 29 (Section 3.3) gave the pmf of Y, 
the number of traffic citations for a randomly 
selected individual insured by a company. 
What is the probability that among 15 ran- 
domly chosen such individuals 


a. At least 10 have no citations? 

b. Fewer than half have at least one 
citation? 

c. The number that have at least one 
citation is between 5 and 10, 
inclusive?? 

A particular type of tennis racket comes in 

a midsize version and an oversize version. 

Sixty percent of all customers at a store 

want the oversize version. 


a. Among ten randomly selected cus- 
tomers who want this type of racket, 


2“Between a and b, inclusive” is equivalent to 
(a < X < b). 
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what is the probability that at least six 
want the oversize version? 

b. Among ten randomly selected cus- 
tomers, what is the probability that the 
number who want the oversize version 
is within | standard deviation of the 
mean value? 

c. The store currently has seven rackets of 
each version. What is the probability 
that all of the next ten customers who 
want this racket can get the version 
they want from current stock? 


Twenty percent of all telephones of a cer- 
tain type are submitted for service while 
under warranty. Of these, 60% can be 
repaired, whereas the other 40% must be 
replaced with new units. If a company 
purchases ten of these telephones, what is 
the probability that exactly two will end up 
being replaced under warranty? 


A March 29, 2019, Washington Post article 
reported that (roughly) 5% of all students 
taking the ACT were granted extra time. 
Assume that 5% figure is exact, and con- 
sider a random sample of 25 students who 
have recently taken the ACT. 


a. What is the probability that exactly 1 
was granted extra time? 

b. What is the probability that at least 1 
was granted extra time? 

c. What is the probability that at least 2 
were granted extra time? 

d. What is the probability that the number 
among the 25 who were granted extra 
time is within 2 standard deviations of 
the number you would expect? 

e. Suppose that a student who does not 
receive extra time is allowed 3 h for the 
exam, whereas an accommodated stu- 
dent is allowed 4.5 h. What would you 
expect the average time allowed the 25 
selected students to be? 


Suppose that 90% of all batteries from a 
supplier have acceptable voltages. A certain 


74. 


75. 
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type of flashlight requires two type-D bat- 
teries, and the flashlight will work only if 
both its batteries have acceptable voltages. 
Among ten randomly selected flashlights, 
what is the probability that at least nine will 
work? What assumptions did you make in 
the course of answering the question 
posed? 


A k-out-of-n system functions provided that 
at least k of the n components function. 
Consider independently operating compo- 
nents, each of which functions (for the 
needed duration) with probability .96. 


a. In a 3-component system, what is the 
probability that exactly two compo- 
nents function? 

b. What is the probability a 2-out-of-3 
system works? 

c. What is the probability a 3-out-of-5 
system works? 

d. What is the probability a 4-out-of-5 
system works? 

e. What does the component probability 
(previously .96) need to equal so that 
the 4-out-of-5 system will function 
with probability at least .9999? 


Bit transmission errors between computers 
sometimes occur, where one computer 
sends a 0 but the other computer receives a 
1 (or vice versa). Because of this, the 
computer sending a message repeats each 
bit three times, so a 0 is sent as 000 and a | 
as 111. The receiving computer “decodes” 
each triplet by majority rule: whichever 
number, 0 or 1, appears more often in a 
triplet is declared to be the intended bit. For 
example, both 000 and 100 are decoded as 
0, while 101 and O11 are decoded as 1. 
Suppose that 6% of bits are switched (0 to 
1, or 1 to 0) during transmission between 
two particular computers, and that these 
errors occur independently during 
transmission. 
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a. Find the probability that a triplet is 
decoded incorrectly by the receiving 
computer. 

b. Using your answer to part (a), explain 
how using triplets reduces communi- 
cation errors. 

c. How does your answer to_ part 
(a) change if each bit is repeated five 
times (instead of three)? 

d. Imagine a 25 kilobit message (i.e., one 
requiring 25,000 bits to send). What is 
the expected number of errors if there 
is no bit repetition implemented? If 
each bit is repeated three times? 


A very large batch of components has 
arrived at a distributor. The batch can be 
characterized as acceptable only if the 
proportion of defective components is at 
most .10. The distributor decides to ran- 
domly select 10 components and to accept 
the batch only if the number of defective 
components in the sample is at most 2. 


a. What is the probability that the batch 
will be accepted when the actual pro- 
portion of defectives is .01? .05? .10? 
.20? .25? 

b. Let p denote the actual proportion of 
defectives in the batch. A graph of 
P(batch is accepted) as a function of 
p, with p on the horizontal axis and 
P(batch is accepted) on the vertical 
axis, is called the operating charac- 
teristic curve for the acceptance sam- 
pling plan. Use the results of part (a) to 
sketch this curve forO < p < 1. 

c. Repeat parts (a) and (b) with “1” 
replacing “2” in the acceptance sam- 
pling plan. 

d. Repeat parts (a) and (b) with “15” 
replacing “10” in the acceptance sam- 
pling plan. 

e. Which of the three sampling plans, that 
of part (a), (c), or (d), appears most 
satisfactory, and why? 


77. 
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An ordinance requiring that a smoke 
detector be installed in all previously con- 
structed houses has been in effect in a city 
for | year. The fire department is concerned 
that many houses remain without detectors. 
Let p = the true proportion of such houses 
having detectors, and suppose that a ran- 
dom sample of 25 homes is inspected. If the 
sample strongly indicates that fewer than 
80% of all houses have a detector, the fire 
department will campaign for a mandatory 
inspection program. Because of the costli- 
ness of the program, the department prefers 
not to call for such inspections unless 
sample evidence strongly argues for their 
necessity. Let X denote the number of 
homes with detectors among the 25 sam- 
pled. Consider rejecting the claim that 
p> 8ifx < 15. 


a. What is the probability that the claim is 
rejected when the actual value of p is 
8? 

b. What is the probability of not rejecting 
the claim when p = .7? When p = .6? 

c. How do the “error probabilities” of 
parts (a) and (b) change if the value 15 
in the decision rule is replaced by 14? 


A toll bridge charges $1.00 for passenger 
cars and $2.50 for other vehicles. Suppose 
that during daytime hours, 60% of all 
vehicles are passenger cars. If 25 vehicles 
cross the bridge during a particular daytime 
period, what is the resulting expected toll 
revenue? [Hint: Let X =the number of 
passenger cars; then the toll revenue h(X) is 
a linear function of X.] 


A student who is trying to write a paper for 
a course has a choice of two topics, A and 
B. If topic A is chosen, the student will 
order two books through interlibrary loan, 
whereas if topic B is chosen, the student 
will order four books. The student believes 
that a good paper necessitates receiving and 
using at least half the books ordered for 
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either topic chosen. If the probability that a 
book ordered through interlibrary loan 
actually arrives in time is .9 and books 
arrive independently of one another, which 
topic should the student choose to maxi- 
mize the probability of writing a good 
paper? What if the arrival probability is 
only .5 instead of .9? 


Twelve jurors are randomly selected from a 
large population. At least in theory, each 
juror arrives at a conclusion about the case 
before the jury independently of the other 
jurors. 


a. In a criminal case, all 12 jurors must 
agree on a verdict. Let p denote the 
probability that a randomly selected 
member of the population would reach 
a guilty verdict based on the evidence 
presented (so a proportion | — p would 
reach “not guilty”). What is the prob- 
ability, in terms of p, that the jury 
reaches a unanimous verdict one way 
or the other? 

b. For what values of p is the probability 
in part (a) the highest? For what value 
of p is the probability in (a) the lowest? 
Explain why this makes sense. 

c. In most civil cases, only a nine-person 
majority is required to decide a verdict. 
That is, if nine or more jurors favor the 
plaintiff, then the plaintiff wins; if at 
least nine jurors side with the defen- 
dant, then the defendant wins. Let 
p denote the probability that someone 
would side with the plaintiff based on 
the evidence. What is the probability, 
in terms of p, that the jury reaches a 
verdict one way or the other? How 
does this compare with your answer to 
part (a)? 

Customers at a gas station pay with a credit 

card (A), debit card (B), or cash (C). 

Assume that successive customers make 

independent choices, with P(A) =.5, 

P(B) = .2, and P(C) = .3. 
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a. Among the next 100 customers, what 
are the mean and variance of the 
number who pay with a debit card? 
Explain your reasoning. 

b. Answer part (a) for the number among 
the 100 who don’t pay with cash. 


An airport limousine can accommodate up 
to four passengers on any one trip. The 
company will accept a maximum of six 
reservations for a trip, and a passenger must 
have a reservation. From previous records, 
20% of all those making reservations do not 
appear for the trip. In the following ques- 
tions, assume independence, but explain 
why there could be dependence. 


a. If six reservations are made, what is the 
probability that at least one individual 
with a reservation cannot be accom- 
modated on the trip? 

b. If six reservations are made, what is the 
expected number of available places 
when the limousine departs? 

c. Suppose the probability distribution of 
the number of reservations made is 
given in the accompanying table. 


Reservations 3 4 5 6 
Probability l 2 3 A 


Let X denote the number of passengers 
on a randomly selected trip. Obtain the 
probability mass function of X. 


Let X be a binomial random variable with a 
specified value of n. 


a. Are there values of p (0 < p < 1) for 
which V(X) = 0? Explain why this is 
So. 

b. For what value of p is V(X) maxi- 
mized? [Hint: Either graph V(X) as a 
function of p or else take a derivative. ] 


a. Verify the relationship b@; n, 1 — p) = 
b(n — x; n, p). 

b. Verify the relationship B(x; n, 1 — p) = 
1—- Baan — x — 15 n, p). [Hint: At most 
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x S’s is to at least 
(n — x) F’s.] 

c. What do parts (a) and (b) imply about 
the necessity of including values of 


p > .5 in Appendix Table A.1? 


Refer to Chebyshev’s inequality given 
in Exercise 45 (Section 3.3). Calculate 
P(X p| > ko) for k = 2 and k = 3 when 
X ~ Bin(20, .5), and compare to the cor- 
responding upper bounds. Repeat for X ~ 
Bin(20, .75). 

Show that E(X) = np when X is a binomial 
random variable. [Hint: Express E(X) as a 


equivalent 


87. 


88. 


is from y = 0 to y=n — 1, and show that 
the sum equals 1.] 

At the end of this section we obtained the 
mean and variance of a binomial rv using 
the mgf. Obtain the mean and variance 
instead from Ry(t) = In{[My(t)]. 


Obtain the moment generating function of 
the number of failures, n — X, in a binomial 
experiment, and use it to determine the 
expected number of failures and the variance 
of the number of failures. Are the expected 
value and variance intuitively consistent with 
the expressions for E(X) and V(X)? Explain. 


sum with lower limit x = 1. Then factor out 
np, let y = x — 1 so that the remaining sum 


3.6 The Poisson Probability Distribution 


The binomial distribution was derived by starting with an experiment consisting of trials and applying 
the laws of probability to various outcomes of the experiment. There is no simple experiment on 
which the Poisson distribution is based, although we will shortly describe how it can be obtained from 
the binomial distribution by certain limiting operations. 


DEFINITION A random variable X is said to have a Poisson distribution with parameter 


Lt (uw > O) if the pmf of X is 


x 


ei 


p(x; w) = x=0,1,2,... 


We shall see shortly that wu is in fact the expected value of X, so the notation 
here is consistent with our previous use of the symbol “. Because uz must be 
positive, p(x; 4) > 0 for all possible x values. The fact that }°* 9 p(x; w) = 1 
is a consequence of the Taylor series expansion of e“, which appears in most 
calculus texts: 


as we Su 
Y=] —+— — 3.17 
e a ae a iad (3.17) 


If the two extreme terms in Expression (3.17) are multiplied by e™ and then e“ 
is placed inside the summation, the result is 


which shows that p(x; 2) fulfills the second condition necessary for specifying 
a pmf. 


3.6 The Poisson Probability Distribution 157 


Example 3.42 The article “Detecting Clostridium difficile Outbreaks With Ward-Specific Cut-Off 
Levels Based on the Poisson Distribution” (infect. Control Hosp. Epidemiol. 2019: 265-266) rec- 
ommends using a Poisson model for X = the number of sporadic C. difficile infections (CDIs) in a 
month in a given hospital ward, as a way to determine when an “outbreak” (that is, an unusually large 
number of CDIs) has occurred. The article considers several values for « for different wards in a 
particular hospital. For a ward in which w = 3 CDIs per month, the probability of observing exactly 5 
CDIs in a particular month is 


—335 
P(X =5) = = .1008 
and the chance of observing at least 5 CDIs is 
4 33x 2 3 4 
e73 4 3 3 3 
P(X>5)=1-P(X<5)=1 », qo ale iss+ Stata = .1847 


These probabilities might not be so low as to convince hospital supervisors that they have an outbreak 
on their hands. On the other hand, in a ward with a historic mean of uw = 1 CDI per month, the 
probabilities are P(X = 5) = .0031 and P(X > 5) = .0037, suggesting that five (or more) CDIs in one 
month would be extremely unusual and should be considered a C. difficile outbreak. Hi 


The Poisson Distribution as a Limit 
The rationale for using the Poisson distribution in many situations is provided by the following 
proposition. 


PROPOSITION Suppose that in the binomial pmf b(x; n, p) we let n — oo and p — O in 
such a way that np approaches a value pp > 0. Then b(x; n, p) > p(x; pL). 


Proof Begin with the binomial pmf: 


Taking the limit as n — oo and p — 0 with np — un, 


x 1- n 
Him B(x; mp) = Vets A (tim mae 
n—-oo x! 


The limit on the right can be obtained from the calculus theorem that says the limit of (1 — a,/n)” is 
e“ if a, — a. Because np > p, 
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xX 


n XxX 5— Ul 
lim b(xsn,p) =" lim (1) ae = p(x; 1) ga 


n-oo xX! n-co n x! 


According to the proposition, in any binomial experiment for which the number of trials n is large 
and the success probability p is small, b(x; n, p) © p(x; ) where « = np. It is interesting to note that 
Siméon Poisson discovered this eponymous distribution by this approach in the 1830s. 

Table 3.3 shows the Poisson distribution for 4 = 3 along with three binomial distributions with 
np = 3, and Figure 3.9 (from R) plots the Poisson along with the first two binomial distributions. The 
approximation is of limited use for n = 30, but of course the accuracy is better for n = 100 and much 
better for n = 300. 


Table 3.3 Comparing the Poisson and three binomial distributions 


x n= 30,p=.1 n= 100, p = .03 n = 300, p = .01 Poisson, pt = 3 
0 0.042391 0.047553 0.049041 0.049787 
1 0.141304 0.147070 0.148609 0.149361 
2; 0.227656 0.225153 0.224414 0.224042 
3 0.236088 0.227474 0.225170 0.224042 
4 0.177066 0.170606 0.168877 0.168031 
5 0.102305 0.101308 0.100985 0.100819 
6 0.047363 0.049610 0.050153 0.050409 
7 0.018043 0.020604 0.021277 0.021604 
8 0.005764 0.007408 0.007871 0.008 102 
9 0.001565 0.002342 0.002580 0.002701 
10 0.000365 0.000659 0.000758 0.000810 
P(x) 
A 
25 o Bin(30, .1) 
9 : x Bin(100,.03) 
| Poisson(3) 
20 5 
QO 
x 
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Figure 3.9 Comparing a Poisson and two binomial distributions 
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Example 3.43 
Suppose you have a 4-megabit modem (4,000,000 bits/s) with bit error probability 10-8. Assume bit 
errors occur independently, and assume your bit rate stays constant at 4 Mbps. What is the probability 
of exactly 3 bit errors in the next minute? Of at most 3 bit errors in the next minute? 

Define a random variable X = the number of bit errors in the next minute. From the description, 
X satisfies the conditions of a binomial distribution; specifically, since a constant bit rate of 4 Mbps 
equates to 240,000,000 bits transmitted per minute, X ~ Bin(240,000,000, 10 °). Hence, the 
probability of exactly three bit errors in the next minute is 


P(X = 3) = b(3; 240,000,000, 10-*) = Gera ) (10-8)?(1 — 10-8) 
For a variety of reasons, some calculators will struggle with this computation. The expression for the 
chance of at most 3 bit errors, P(X < 3), is even worse. (The inability to compute such expressions 
in the nineteenth century, even with modest values of n and p, was Poisson’s motive to derive an 
easily computed approximation.) 
We may approximate these probabilities using the Poisson distribution. The parameter yw is given 
by u = np = 240,000,000(10 8) = 2.4, whence 


e-2-47.43 
P(X = 3) © p(3;2.4) = 37 — = .20901416 


Similarly, the probability of at most 3 bit errors in the next minute is approximated by 


2 Lr ma 
P(X<3) © ¥"p(x,2.4) => —— = 77872291 
x=0 x=0 . 


Using software, the exact probabilities (i.e., using the binomial model) are .2090141655 and 
.7787229 106, respectively. The Poisson approximations agree to eight decimal places and are clearly 
more computationally tractable. a 


Many software packages will compute both p(x; 4) and the corresponding cdf P(x; w) for specified 
values of x and w upon request; the relevant R functions appear in Table 3.4. Appendix Table A.2 
exhibits the cdf P@; w) for w= .1, .2,..., 1, 2,..., 10, 15, and 20. For example, if uw = 2, then 
P(X < 3) = PQ; 2) = .857, whereas P(X = 3) = P(3; 2) — P(2; 2) = .180. 


Table 3.4 Poisson probability calculations in R 


Function: pmf cdf 
Notation: ps L) P(x; w) 
R: dpois (x, u) ppois (x, “) 


The Mean, Variance, and MGF of a Poisson Random Variable 
Since D(x; n, p) — p(x; LL) as n > oo, p > 0, np — p, one might guess that the mean and variance of 
a binomial variable approach those of a Poisson variable. These limits are, respectively, np — jp and 


np(1 — p) > pb. 
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PROPOSITION If X has a Poisson distribution with parameter p, then E(X) = V(X) = wu. 


These results can also be derived directly from the definitions of mean and variance (see Exercise 104 
for the mean). 


Example 3.44 (Example 3.42 continued) For the hospital ward with « = 3, the expected number of 
CDIs in a month is 3 (obviously), and the standard deviation of the number of monthly CDIs is 
Ox = /h= V3 = 1.73. So, observing 2-4 CDIs in a month would not be unusual (those values are 
within one sd of the mean), but a month with 7 CDIs on the ward would be alarming (since that’s 
more than two standard deviations above average). H 


The moment generating function of the Poisson distribution is easy to derive, and it gives a direct 
route to the mean and variance (Exercise 106). 


PROPOSITION The Poisson moment generating function is 


Mx(t) = et? 
Proof The megf is by definition 
ee) fore) t\x 
My(t) = E(e*) = Sette we = et)” te) = eo Hele! = eltle—1) 
x=0 - x=0 a 
This uses the series expansion )°~~_ 9 u*/x! =e". a 


The Poisson Process 

A very important application of the Poisson distribution arises in connection with the occurrence of 
events over time. As an example, suppose that starting from a time point that we label t = 0, we are 
interested in counting the number of radioactive pulses recorded by a Geiger counter. We make the 
following assumptions about the way in which pulses occur: 


1. There exists a parameter 2 > 0 such that for any short time interval of length Ar, the probability 

that exactly one pulse is received is 2 - At + o(At).* 

2. The probability of more than one pulse being received during Ar is o(At). [This, along with 

Assumption 1, implies that the probability of no pulses during At is 1 — 4 - At — o(Ad)). 

3. The number of pulses received during the time interval Aft is independent of the number 

received prior to this time interval. 

Informally, Assumption | says that for a short interval of time, the probability of receiving a single 
pulse is approximately proportional to the length of the time interval, where 2 is the constant of 
proportionality. Now let P,(t) denote the probability that exactly k pulses will be received by the 
counter during any particular time interval of length t. 


PROPOSITION P(t) = e~““(At)*/k1, so that the number of pulses during a time interval 
of length ¢ is a Poisson rv with parameter « = At. The expected number 
of pulses during any such time interval is then At, so the expected number 
during a unit interval of time is 2. 


4A quantity is o(Af) (read “little o of delta ’’) if, as At approaches 0, so does o(At)/At. That is, o(At) is even more 
negligible than At itself. The quantity (A#)* has this property, but sin(Ar) does not. 
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Example 3.43 hints at why this might be reasonable: if we “digitize” time—that is, divide time into 
discrete pieces, such as transmitted bits—and look at the number of the resulting time pieces that 
include an event, a binomial model is often applicable. If the number of time pieces is very large and 
the success probability close to zero, which would occur if we divided a fixed time frame into ever- 
smaller pieces, then we may invoke the Poisson approximation from earlier in this section. See 
Exercise 105 for a derivation. 


Example 3.45 Suppose pulses arrive at the Geiger counter at an average rate of six per minute, so 
that 2 = 6. To find the probability that in a 30-second interval at least one pulse is received, note that 
the number of pulses in such an interval has a Poisson distribution with parameter Jt = 6(.5) = 3 
(.5 min is used because / is expressed as a rate per minute). Then with X = the number of pulses 
received in the 30-second interval, 


P(X>1)=1-—P(X =0) = = 950 


In a one-hour interval (t = 60), the expected number of pulses is u = At = 6(60) = 360, with a 
standard deviation of ¢ = ,/H = V360 = 18.97. According to this model, in a typical hour we will 
observe 360 + 19 pulses arrive at the Geiger counter. Bo 


If in Assumptions 1-3 we replace “pulse” by “event,” then the number of events occurring during 
a fixed time interval of length t has a Poisson distribution with parameter At. Any process that has this 
distribution is called a Poisson process, and 2 is called the rate of the process. Other examples of 
situations giving rise to a Poisson process include monitoring the status of a computer system over 
time, with breakdowns constituting the events of interest; recording the number of accidents in an 
industrial facility over time; logging hits to a website; and observing the number of cosmic-ray 
showers from an observatory over time. 

Instead of observing events over time, consider observing events of some type that occur in a two- 
or three-dimensional region. For example, we might select on a map a certain region R of a forest, go 
to that region, and count the number of trees. Each tree would represent an event occurring at a 
particular point in space. Under assumptions similar to 1—3, it can be shown that the number of events 
occurring in a region R has a Poisson distribution with parameter 2 - a(R), where a(R) is the area or 
volume of R. The quantity 2 is the expected number of events per unit area or volume. 


Exercises: Section 3.6 (89-107) 


89. Let X, the number of flaws on the surface of 90. Suppose the number X of tomadoes 


a randomly selected carpet of a particular observed in a particular region during a 1- 
type, have a Poisson distribution with year period has a Poisson distribution with 
parameter 4 = 5. Use software or Appendix u=8. 

Table A.2 to compute the following a. Compute P(X < 5). 

probabilities: b. Compute P(6 < X < 9). 

a. P(X < 8) c. Compute P(10 < X). 

b. P(X = 8) d. What is the probability that the 
ce. PO < X) observed number of tornadoes exceeds 
d. PS < X < 8) the expected number by more than | 
e. PI <X< 8) standard deviation? 
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Suppose that the number of drivers who 
travel between a particular origin and des- 
tination during a designated time period has 
a Poisson distribution with parameter 
u = 20 (suggested in the article “Dynamic 
Ride Sharing: Theory and _ Practice,” 
J. Transp. Engr. 1997: 308-312). What is 
the probability that the number of drivers 
will 
a. Be at most 10? 
Exceed 20? 
c. Be between 10 and 20, inclusive? Be 
strictly between 10 and 20? 
d. Be within 2 standard deviations of the 
mean value? 


Consider writing onto a computer disk and 
then sending it through a certifier that 
counts the number of missing pulses. Sup- 
pose this number X has a Poisson distri- 
bution with parameter “ = .2. (Suggested in 
“Average Sample Number for Semi- 
Curtailed Sampling Using the Poisson 
Distribution,” J. Qual. Tech. 1983: 126- 
129.) 


a. What is the probability that a disk has 
exactly one missing pulse? 

b. What is the probability that a disk has 
at least two missing pulses? 

c. If two disks are independently selected, 
what is the probability that neither 
contains a missing pulse? 


The article “Metal Hips Fail Faster, Raise 
Other Health Concerns” on the www. 
arthritis.ccom website reported that the 
five-year failure rate of metal-on-plastic 
implants was 1.7% (rates for metal-on- 
metal and ceramic implants were signifi- 
cantly higher). Use both a binomial calcu- 
lation and a Poisson approximation to 
answer each of the following. 


a. Among 200 randomly selected such 
implants, what is the probability that 
exactly three will fail? 

b. Among 200 randomly selected such 
implants, what is the probability that at 
most three will fail? 


94. Suppose that only .10% of all computers of 


95. 


96. 


97. 


a certain type experience CPU failure dur- 
ing the warranty period. Consider a sample 
of 10,000 computers. 


a. What are the expected value and stan- 
dard deviation of the number of com- 
puters in the sample that have the 
defect? 

b. What is the (approximate) probability 
that more than 10 sampled computers 
have the defect? 

c. What is the (approximate) probability 
that no sampled computers have the 
defect? 


If a publisher of nontechnical books takes 
great pains to ensure that its books are free 
of typographical errors, so that the proba- 
bility of any given page containing at least 
one such error is .005 and errors are inde- 
pendent from page to page, what is the 
probability that one of its 400-page novels 
will contain exactly one page with errors? 
At most three pages with errors? 


In proof testing of circuit boards, the 
probability that any particular diode will 
fail is .01. Suppose a circuit board contains 
200 diodes. 


a. How many diodes would you expect to 
fail, and what is the standard deviation 
of the number that are expected to fail? 

b. What is the (approximate) probability 
that at least four diodes will fail on a 
randomly selected board? 

c. Iffive boards are shipped to a particular 
customer, how likely is it that at least 
four of them will work properly? (A 
board works properly only if all its 
diodes work.) 


Suppose small aircraft arrive at an airport 
according to a Poisson process with rate 
A = 8 per hour, so that the number of arri- 
vals during a time period of ¢ hours is a 
Poisson rv with parameter yw = 8t. 


a. What is the probability that exactly 6 
small aircraft arrive during a 1-h per- 
iod? At least 6? At least 10? 


3.6 The Poisson Probability Distribution 


98. 


99. 


100. 


b. What are the expected value and stan- 
dard deviation of the number of small 
aircraft that arrive during a 90-min 
period? 

c. What is the probability that at least 20 
small aircraft arrive during a 2.5-h 
period? That at most 10 arrive during 
this period? 

The number of people arriving for treat- 

ment at an emergency room can be mod- 

eled by a Poisson process with a rate 
parameter of 5 per hour. 


a. What is the probability that exactly 
four arrivals occur during a particular 
hour? 

b. What is the probability that at least four 
people arrive during a particular hour? 

c. How many people do you expect to 
arrive during a 45-min period? 

The number of requests for assistance 

received by a towing service is a Poisson 

process with rate 2 = 4 per hour. 


a. Compute the probability that exactly 
ten requests are received during a par- 
ticular 2-h period. 

b. If the operators of the towing service 
take a 30-min break for lunch, what is 
the probability that they do not miss 
any calls for assistance? 

c. How many calls would you expect 
during their break? 


The article “Expectation Analysis of the 
Probability of Failure for Water Supply 
Pipes” (J. Pipeline Syst. Engr. Pract. 2012: 
36-46) recommends using a Poisson pro- 
cess to model the number of failures in 
commercial water pipes. The article also 
gives estimates of the failure rate A, in units 
of failures per 100 miles of pipe per day, for 
four different types of pipe and for many 
different years. 


a. For PVC pipe in 2008, the authors 
estimate a failure rate of 0.0081 failures 


101. 


102. 


163 


per 100 miles of pipe per day. Consider 
a 100-mile-long segment of such pipe. 
What is the expected number of failures 
in one year (365 days)? Based on this 
expectation, what is the probability of 
at least one failure along such a pipe in 
one year? 

b. For cast iron pipe in 2005, the authors’ 
estimate is 2 = 0.0864 failures per 100 
miles per day. Suppose a town had 
1500 miles of cast iron pipe under- 
ground in 2005. What is the probability 
of at least one failure somewhere along 
this pipe system on any given day? 

The article “Reliability-Based Service-Life 
Assessment of Aging Concrete Structures” 
(J. Struct. Engr. 1993: 1600-1621) sug- 
gests that a Poisson process can be used to 
represent the occurrence of structural loads 
over time. Suppose the mean time between 
occurrences of loads (which can be shown 
to be = 1/2) is .5 year. 


a. How many loads can be expected to 
occur during a 2-year period? 

b. What is the probability that more than 
five loads occur during a 2-year 
period? 

c. How long must a time period be so that 
the probability of no loads occurring 
during that period is at most .1? 


Automobiles arrive at a vehicle equipment 
inspection station according to a Poisson 
process with rate 1 = 10 per hour. Suppose 
that with probability .5 an arriving vehicle 
will have no equipment violations. 


a. What is the probability that exactly ten 
arrive during the hour and all ten have 
no violations? 

b. For any fixed y > 10, what is the 
probability that y arrive during the 
hour, of which ten have no violations? 

c. What is the probability that ten “no- 
violation” cars arrive during the next 


164 


103. 


104. 


105. 


3 Discrete Random Variables and Probability Distributions 


hour? [Hint: Sum the probabilities in 
part (b) from y = 10 to oo.] 


Suppose that trees are distributed in a forest 
according to a two-dimensional Poisson 
process with parameter A, the expected 
number of trees per acre, equal to 80. 


a. What is the probability that in a certain 
quarter-acre plot, there will be at most 
16 trees? 

b. If the forest covers 85,000 acres, what 
is the expected number of trees in the 
forest? 

c. Suppose you select a point in the forest 
and construct a circle of radius.1 mile. 
Let X = the number of trees within that 
circular region. What is the pmf of X? 
[Hint: 1 sq mile = 640 acres.] 


Let X have a Poisson distribution with 
parameter uw. Show that E(X) = w directly 
from the definition of expected value. 
[Hint: The first term in the sum equals 0, 
and then x can be canceled. Now factor out 
u and show that what is left sums to 1.] 


a. In a Poisson process, what has to hap- 
pen in both the time interval (0, f) and 
the interval (t, t + Af) so that no events 
occur in the entire interval (0, t + Ar)? 
Use this and Assumptions 1-3 to write 
a relationship between Po(t + At) and 
Po(t). 

b. Use the result of part (a) to write an 
expression for the difference Po(t+ Af) 
— Po(t). Then divide by At and let 
At—0 to obtain an _ equation 
involving (d/dt)P,(t), the derivative of 
Po(t) with respect to tf. 
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c. Verify that Po(t) =e” satisfies the 
equation of part (b). 

d. It can be shown in a manner similar to 
parts (a) and (b) that the P,(t)’s must 
satisfy the system of differential 
equations 


d 
at kt) = APx-1(t) _ AP, (t) 
f= 123s, 


Verify that P;(t) = e~“"(At)*/k! satisfies 
the system. (This is actually the only 
solution.) 


a. Use derivatives of the moment gener- 
ating function to obtain the mean and 
variance for the Poisson distribution. 

b. As discussed in Section 3.4, obtain the 
Poisson mean and variance from 
Rx(t) = In[My(H]. In terms of effort, 
how does this method compare with 
the one in part (a)? 


Show that the binomial moment generating 
function converges to the Poisson moment 
generating function if we let n — oo and 
p — O in such a way that np approaches a 
value uw > 0. [Hint: Use the calculus theo- 
rem that was used in showing that the 
binomial probabilities converge to the 
Poisson probabilities.] There is in fact a 
theorem saying that convergence of the mgf 
implies convergence of the probability 
distribution. In particular, convergence of 
the binomial mgf to the Poisson megf 
implies b(x; n, p) — p(x; 4). 


This section introduces discrete distributions that are closely related to the binomial distribution. 
Whereas the binomial distribution is the approximate probability model for sampling without 
replacement from a finite dichotomous (S/F) population, the hypergeometric distribution is the exact 
probability model for the number of S’s in the sample. The binomial rv X is the number of S’s when 
the number n of trials is fixed, whereas the negative binomial distribution arises from fixing the 
number of S’s desired and letting the number of trials be random. 
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The Hypergeometric Distribution 
The assumptions leading to the hypergeometric distribution are as follows: 


1. The population or set to be sampled consists of N individuals, objects, or elements (a finite 
population). 

2. Each individual can be characterized as a success (S) or a failure (F), and there are M successes 
in the population. 

3. A sample of n individuals is selected without replacement in such a way that each subset of size 
n is equally likely to be chosen. 

The random variable of interest is X = the number of S’s in the sample. The probability distri- 

bution of X depends on the parameters n, M, and N, so we wish to obtain P(X = x) = h(x; n, M, N). 


Example 3.46 During a particular period, a university’s information technology office received 20 
service orders for problems with laptops, of which 8 were Macs and 12 were PCs. A sample of 5 of 
these service orders is to be selected for inclusion in a customer satisfaction survey. Suppose that the 
5 are selected in a completely random fashion, so that any particular subset of size 5 has the same 
chance of being selected as does any other subset (think of putting the numbers 1, 2, ..., 20 on 20 
identical slips of paper, mixing up the slips, and choosing 5 of them). What then is the probability that 
exactly 2 of the selected service orders were for PC laptops? 

In this example, the population size is N = 20, the sample size is n = 5, and the number of S’s 
(PC = S) and F’s (Mac = F) in the population are M = 12 and N — M = 8, respectively. Let X = the 
number of PCs among the 5 sampled service orders. Because all outcomes (each consisting of 5 
particular orders) are equally likely, 


P(X = 2) = h(2;5, 12,20) = number of outcomes having X = 2 


number of possible outcomes 


The number of possible outcomes in the experiment is the number of ways of selecting 5 from the 20 


objects without regard to order—that is, i To count the number of outcomes having X = 2, note 
that there are ey ways of selecting 2 of the PC orders, and for each such way there are (5) ways 


of selecting the 3 Mac orders to fill out the sample. The Fundamental Counting Principle from 


Section 2.3 then gives a) : (5) as the number of outcomes with X = 2, so 


(2G) 
2)\3 (66)(56) 77 
h(2; 5, 12,20) = = = —_=2 | 
ered) & 15,504 323 a8 
5 


In general, if the sample size n is smaller than the number of successes in the population (M), then the 
largest possible X value is n. However, if M < n (e.g., a sample size of 25 and only 15 successes in the 
population), then X can be at most M. Similarly, whenever the number of population failures 
(N — M) exceeds the sample size, the smallest possible X value is O (since all sampled individuals 
might then be failures). However, if N — M <n, the smallest possible X value is n — (N — M). Thus, 
the possible values of X satisfy the restriction max(O,n —- N+ M) < x < min(n, M). An argument 
parallel to that of the previous example gives the pmf of X. 


166 3 Discrete Random Variables and Probability Distributions 


PROPOSITION If X is the number of S’s in a random sample of size n drawn 
from a population consisting of M S’s and (N — M) F’s, then 
the probability distribution of X, called the hypergeometric 
distribution, is given by 


eer cee, a 


for x an integer satisfying max(0,n - N+ M) < x < min(n, M).2 


In Example 3.46, n = 5, M = 12, and N = 20, so h(@; 5, 12, 20) for x = 0, 1, 2, 3, 4, 5 can be obtained 
by substituting these numbers into Equation (3.19). 


Example 3.47 Capture—recapture. Five individuals from an animal population thought to be near 
extinction in a region have been caught, tagged, and released to mix into the population. After they 
have had an opportunity to mix, a random sample of ten of these animals is selected. Let X = the 
number of tagged animals in the second sample. If there are actually 25 animals of this type in the 
region, what is the probability that (a) X = 2? (b) X < 2? 

Application of the hypergeometric distribution here requires assuming that every subset of 10 
animals has the same chance of being captured. This in turn implies that released animals are no 
easier or harder to catch than are those not initially captured. Then the parameter values are n = 10, 
M =5 (5 tagged animals in the population), and N = 25, so 


el - ) 
x 10—x 
h(x; 10,5, 25) = ~~~_——_ x = 0, 1,2,3,4,5 


(00) 


P(X = 2) = h(2; 10,5,25) = (@)(@) = .385 


For part (a), 


For part (b), 


2 
P(X <2) = P(X =0,1, or 2) = S~ h(x; 10,5, 25) 
x=0 
= .057 + .257 +.385 = .699 = 


R and other software packages will easily generate hypergeometric probabilities; see Table 3.5 at the 
end of this section. Comprehensive tables of the hypergeometric distribution are available, but 


>If we define (1) = 0 for a < b, then h(x; n, M, N) may be applied for all integers 0 < x < n. 
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because the distribution has three parameters, these tables require much more space than tables for the 
binomial or Poisson distributions. 


As in the binomial case, there are simple expressions for E(X) and V(X) for hypergeometric rvs. 


PROPOSITION The mean and variance of the hypergeometric rv X having pmf 
h(x; n, M, N) are 


enon van) (0) 


The proof will be given in Section 6.3. We do not give the moment generating function for the 
hypergeometric distribution, because the mgf is more trouble than it is worth here. 

The ratio M/N is the proportion of S’s in the population. Replacing M/N by p in E(X) and 
V(X) gives 


N-n 
N-1 


B(x) = np V(X) = (F=*) «p(t =p) (3.19) 


Expression (3.19) shows that the means of the binomial and hypergeometric rvs are equal, whereas 
the variances of the two rvs differ by the factor (V — n)/(N — 1), often called the finite population 
correction factor. This factor is < 1, so the hypergeometric variable has smaller variance than does 
the binomial rv. The correction factor can be written (1 — n/N)/(1 — 1/N), which is approximately 1 
when n is small relative to N. 


Example 3.48 (Example 3.47 continued) In the animal-tagging example, n = 10, M=5, and 
N = 25, so p = 5/25 = .2 and 


15 


VO) = 94 


(10)(.2)(.8) = (.625)(1.6) = 1 
If the sampling were carried out with replacement, V(X) = 1.6. 

Suppose the population size N is not actually known, so the value x is observed and we wish to 
estimate N. It is reasonable to equate the observed sample proportion of S’s, x/n, with the population 
proportion, M/N, giving the estimate 
M-n 


x 


N= 


If M = 100, n = 40, and x = 16, then N = 250. | 


Our rule in Section 3.5 stated that if sampling is without replacement but n/N is at most .05, then the 
binomial distribution can be used to compute approximate probabilities involving the number of S’s in 
the sample. A more precise statement is as follows: Let the population size, N, and number of 
population S’s, M, get large with the ratio M/N approaching p. Then h(x; n, M, N) approaches b(x; n, p); 
so for n/N small, the two are approximately equal provided that p is not too near either 0 or 1. This is 
the rationale for our rule. 
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The Negative Binomial and Geometric Distributions 
The negative binomial distribution is based on an experiment satisfying the following conditions: 


1. The experiment consists of a sequence of independent trials. 

2. Each trial can result in either a success (S) or a failure (F). 

3. The probability of success is constant from trial to trial, so P(S on trial i) = p fori = 1,2,3 .... 

4. The experiment continues (trials are performed) until a total of r successes have been observed, 

where r is a specified positive integer. 

The random variable of interest is X = the number of trials required to achieve the rth success, and 
X is called a negative binomial random variable. In contrast to the binomial rv, the number of 
successes is fixed and the number of trials is random. Possible values of X are r, r+ 1, r+ 2, ..., 
since it takes at least r trials to achieve r successes. 

Let nb(x; r, p) denote the pmf of X. The event {X = x} is equivalent to {r — 1 S’s in the first 
(x — 1) trials and an S on the xth trial}; e.g., if = 5 and x = 15, then there must be four S’s in the first 
14 trials and trial 15 must be an S. Since trials are independent, 


nb(x;r,p) = P(X =x) = P(r — 18S’s on the first x — 1 trials) - P(S) (3.20) 


The first probability on the far right of Expression (3.20) is the binomial probability 


é 7 i ert —p) VY where p = P(S) 


x 
Simplifying and then multiplying by the extra factor of p at the end of (3.20) yields the pmf. 


PROPOSITION The pmf of the negative binomial rv X with parameters r = desired 
number of S’s and p = P(S) is 


motos) =(*— | er =p)" ee ae OY ee 


Example 3.49 A pediatrician wishes to recruit 4 couples, each of whom is expecting their first child, 
to participate in a new natural childbirth regimen. Let p = P (a randomly selected couple agrees to 
participate). If p = .2, what is the probability that exactly 15 couples must be asked before 4 are found 
who agree to participate? Substituting r = 4, p = .2, and x = 15 into nb(; r, p) gives 

#0542) = Gar ) 248! = 050 


The probability that at most 15 couples need to be asked is 


15 15 
P(X <15) =) > nb(x; 4,.2) = > 3 : ) ater = 352 = 
x=4 


In the special case r = 1, the pmf is 
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nb(x;1,p)=(1—py'p x=1,2,... (3.21) 


In Example 3.10, we derived the pmf for the number of trials necessary to obtain the first S, and the 
pmf there is identical to Expression (3.21). The variable X = number of trials required to achieve one 
success is referred to as a geometric random variable, and the pmf in (3.21) is called the geometric 
distribution. The name is appropriate because the probabilities form a geometric series: p, (1 — p)p, 


d - py p. .... To see that the sum of the probabilities is 1, recall that the sum of a geometric series is 
atar+ar + ---=a/(1—r) if |r|<1, so for p > 0, 
p+(1—p)p+(1—p)p+ + =—>— = 1 
L= (Lp) 


In Example 3.19, the expected number of trials until the first S was shown to be I/p. Intuitively, we 
would then expect to need r - 1/p trials to achieve the rth S, and this is indeed E(X). There is also a 
simple formula for V(X) and for the mef. 


PROPOSITION If X is a negative binomial rv with parameters r and p, then 


See Exercise 123 for a derivation of these formulas. The corresponding formulas for the geometric 
distribution are obtained by substituting r = 1 above. 


Example 3.50 (Example 3.49 continued) With p = .2, the expected number of couples the doctor 
must speak to in order to find 4 that will agree to participate is r/p = 4/.2 = 20. This makes sense, 
since with p = .2 = 1/5 it will take 5 attempts, on average, to achieve one success. The corresponding 
variance is 4(1 — 2.2)" = 80, for a standard deviation of about 8.9. = 


Since they are based on similar experiments, some caution must be taken to distinguish the 
binomial and negative binomial models, as seen in the next example. 


Example 3.51 In many communication systems, a receiver will send a short signal back to the 
transmitter to indicate whether a message has been received correctly or with errors. (These signals 
are often called an acknowledgement and a nonacknowledgement, respectively. Bit sum checks and 
other tools are used by the receiver to determine the absence or presence of errors.) Assume we are 
using such a system in a noisy channel, so that each message is sent error-free with probability .86, 
independent of all other messages. What is the probability that in 10 transmissions, exactly 8 will 
succeed? What is the probability the system will require exactly 10 attempts to successfully transmit 8 
messages? 

While these two questions may sound similar, they require two different models for solution. To 
answer the first question, let X = the number of successful transmissions among the 10. Then X ~ 
Bin(10, .86), and the answer is 


10 


P(X = 8) = b(8; 10, .86) = ( : 


) (86)*(414) = .2639 
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However, the event {exactly 10 attempts required to successfully transmit 8 messages} is more 
restrictive: not only must we observe 8 S’s and 2 F’s in 10 trials, but the last trial must be a success. 
Otherwise, it took fewer than 10 tries to send 8 messages successfully. Define a variable Y = the 
number of transmissions (trials) required to successfully transmit 8 messages. Then Y is negative 
binomial, with r = 8 and p = .86, and the answer to the second question is 


10-1 


P(Y = 10) = nb(10; 8, 86) = ( ane 


) (86)*(14) = 2111 


Notice this is smaller than the answer to the first question, which makes sense because (as we noted) 
the second question imposes an additional constraint. In fact, you can think of the “—1” terms in the 
negative binomial pmf as accounting for this loss of flexibility in the placement of S’s and F’s. 
Similarly, the expected number of successful transmissions in 10 attempts is E(X) = np = 10(.86) 
= 8.6, while the expected number of attempts required to successfully transmit 8 messages is 
E(Y) = r/p = 8/.86 = 9.3. In the first case, the number of trials (7 = 10) is fixed, while in the second 
case the desired number of successes (r = 8) is fixed. a 
By expanding the binomial coefficient in front of p’(1 — p)*” and doing some cancelation, it can 
be seen that nb(x; r, p) is well defined even when r is not an integer. This generalized negative 
binomial distribution has been found to fit observed data quite well in a wide variety of applications. 


Alternative Definition of the Negative Binomial Distribution 

There is not universal agreement on the definition of a negative binomial random variable (or, by 
extension, a geometric rv). It is not uncommon in the literature, as well as in some textbooks 
(including previous editions of this book), to see the number of failures preceding the rth success 
called “negative binomial”; in our notation, this simply equals X — r. Possible values of this “number 
of failures” variable are 0, 1, 2, .... Similarly, the geometric distribution is sometimes defined in terms 
of the number of failures preceding the first success in a sequence of independent and identical trials. 
If one uses these alternative definitions, then the pmf, mean, and mgf formulas must be adjusted 
accordingly (the variance, however, will stay the same). See Exercise 124. 

The developers of R are among those who have adopted this alternative definition; as a result, we 
must be careful with our inputs to the relevant software functions. The pmf syntax for the distri- 
butions in this section are cataloged in Table 3.5; cdfs may be invoked by changing the initial letter d 
to p in R. Notice the input argument x — r for the negative binomial functions: R requests the number 
of failures, rather than the number of trials. 


Table 3.5 R code for hypergeometric and negative binomial calculations 


Hypergeometric Negative Binomial 
Function: pmf pmf 
Notation: h(x; n, M, N) nb(x; r, p) 
R: dhyper (x, M, N—-M, n) dnbinom (x-r, r, p) 


For example, suppose X has a hypergeometric distribution with n = 10, M=5, N= 25 as in 
Example 3.47. Using R, we may calculate P(X = 2) = dhyper(2,5,20,10) and P(X < 2)= 
phyper (2,5,20,10).IfX is the negative binomial variable of Example 3.49 with parameters r = 4 
and p = .2, then the chance of requiring 15 trials to achieve 4 successes (i.e., 11 total failures) can be 
found in in R with dnbinom(11,4,.2). 
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Exercises: Section 3.7 (108-124) 


108. 


109. 


110. 


An electronics store has received a ship- 
ment of 20 table radios that have connec- 
tions for an iPod or iPhone. Twelve of these 
have two slots (so they can accommodate 
both devices), and the other eight have a 
single slot. Suppose that six of the 20 radios 
are randomly selected to be stored under a 
shelf where radios are displayed, and the 
remaining ones are placed in a storeroom. 
Let X =the number among the radios 
stored under the display shelf that have two 
slots. 


a. What kind of a distribution does X have 
(name and values of all parameters)? 

b. Compute P(X = 2), P(X < 2), and 
P(X > 2). 

c. Calculate the mean value and standard 
deviation of X. 


Each of 12 refrigerators has been returned 
to a distributor because of an audible, high- 
pitched, oscillating noise when the refrig- 
erator is running. Suppose that 7 of these 
refrigerators have a defective compressor 
and the other 5 have less serious problems. 

If the refrigerators are examined in random 

order, let X be the number among the first 6 

examined that have a defective compressor. 

Compute the following: 

a. P(X =5) 

P(X < 4) 

c. The probability that X exceeds its mean 
value by more than 1 standard 
deviation. 

d. Consider a large shipment of 400 
refrigerators, of which 40 have defec- 
tive compressors. If X is the number 
among 15 randomly selected refriger- 
ators that have defective compressors, 
describe a less tedious way to calculate 
(at least approximately) P(X < 5) than 
to use the hypergeometric pmf. 


An instructor who taught two sections of 
statistics last term, the first with 20 students 


111. 


112. 
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and the second with 30, decided to assign a 
term project. After all projects had been 
turned in, the instructor randomly ordered 
them before grading. Consider the first 15 
graded projects. 


a. What is the probability that exactly 10 
of these are from the second section? 

b. What is the probability that at least 10 
of these are from the second section? 

c. What is the probability that at least 10 
of these are from the same section? 

d. What are the mean value and standard 
deviation of the number among these 
15 that are from the second section? 

e. What are the mean value and standard 
deviation of the number of projects not 
among these first 15 that are from the 
second section? 


A geologist has collected 10 specimens of 
basaltic rock and 10 specimens of granite. 
The geologist instructs a laboratory assis- 
tant to randomly select 15 of the specimens 
for analysis. 


a. What is the pmf of the number of 
granite specimens selected _for 
analysis? 

b. What is the probability that all speci- 
mens of one of the two types of rock 
are selected for analysis? 

c. What is the probability that the number 
of granite specimens selected for anal- 
ysis is within | standard deviation of its 
mean value? 


Suppose that 20% of all individuals have an 
adverse reaction to a particular drug. 
A medical researcher will administer the 
drug to one individual after another until the 
first adverse reaction occurs. Define an 
appropriate random variable and use its dis- 
tribution to answer the following questions. 


a. What is the probability that when the 
experiment terminates, four individuals 
have not had adverse reactions? 

b. What is the probability that the drug is 
administered to exactly five individuals? 
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c. What is the probability that at most 
four individuals do not have an adverse 
reaction? 

d. How many individuals would you 
expect to not have an adverse reaction, 
and to how many individuals would 
you expect the drug to be given? 

e. What is the probability that the number 
of individuals given the drug is within 
1 standard deviation of what you 
expect? 

Twenty pairs of individuals playing in a 

bridge tournament have been seeded 1, ..., 

20. In the first part of the tournament, the 

20 are randomly divided into 10 east-west 

pairs and 10 north-south pairs. 


a. What is the probability that x of the top 
10 pairs end up playing east—west? 

b. What is the probability that all of the 
top five pairs end up playing the same 
direction? 

c. If there are 2n pairs, what is the pmf of 
X = the number among the top n pairs 
who end up playing east-west? What 
are E(X) and V(X)? 


A second-stage smog alert has been called 
in an area of Los Angeles County in which 
there are 50 industrial firms. An inspector 
will visit 10 randomly selected firms to 
check for violations of regulations. 


a. If 15 of the firms are actually violating 
at least one regulation, what is the pmf 
of the number of firms visited by the 
inspector that are in violation of at least 
one regulation? 

b. If there are 500 firms in the area, of 
which 150 are in violation, approxi- 
mate the pmf of part (a) by a simpler 
pmf. 

c. For X=the number among the 10 
visited that are in violation, compute 
E(X) and V(X) both for the exact pmf 
and the approximating pmf in part (b). 


Suppose that p= P(female birth) = °5. 
A couple wishes to have exactly two female 


116. 
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children in their family. They will have 
children until this condition is fulfilled. 


a. What is the probability that the family 
has x male children? 

b. What is the probability that the family 
has four children? 

c. What is the probability that the family 
has at most four children? 

d. How many children would you expect 
this family to have? How many male 
children would you expect this family 
to have? 


A family decides to have children until it 
has three children of the same sex. 
Assuming P(B) = P(G) = .5, what is the 
pmf of X = the number of children in the 
family? 

Three brothers and their wives decide to 
have children until each family has two 
female children. Let X = the total number 
of male children born to the brothers. What 
is E(X), and how does it compare to the 
expected number of male children born to 
each brother? 


Individual A has a red die and B has a green 
die (both fair). If they each roll until they 
obtain five “doubles” (11, ..., 66), what is 
the pmf of X = the total number of times a 
die is rolled? What are E(X) and SD(X)? 


A shipment of 20 integrated circuits 
(ICs) arrives at an electronics manufactur- 
ing site. The site manager will randomly 
select 4 ICs and test them to see whether 
they are faulty. Unknown to the site man- 
ager, 5 of these 20 ICs are faulty. 


a. Suppose the shipment will be accepted 
if and only if none of the inspected 
ICs is faulty. What is the probability 
this shipment of 20 ICs will be 
accepted? 

b. Now suppose the shipment will be 
accepted if and only if at most one of 
the inspected ICs is faulty. What is the 
probability this shipment of 20 ICs will 
be accepted? 
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c. How do your answers to (a) and 
(b) change if the number of faculty ICs 
in the shipment is 3 instead of 5? Re- 
calculate (a) and (b) to verify your 
claim. 


120. A carnival game consists of spinning a 


121. 


wheel with 10 slots, nine red and one blue. 
If you land on the blue slot, you win a 
prize. Suppose your significant other really 
wants that prize, so you will play until you 
win. 


a. What is the probability you’ll win on 
the first spin? 

b. What is the probability you'll require 
exactly 5 spins? At least 5 spins? At 
most five spins? 

c. What is the expected number of spins 
required for you to win the prize, and 
what is the corresponding standard 
deviation? 


A kinesiology professor, requiring volun- 
teers for her study, approaches students one 
by one at a campus hub. She will continue 
until she acquires 40 volunteers. Suppose 
that 25% of students are willing to volun- 
teer for the study, that the professor’s 
selections are random, and that the student 
population is large enough that individual 
“trials” (asking a student to participate) 
may be treated as independent. 


a. What is the expected number of stu- 
dents the kinesiology professor will 
need to ask in order to get 40 volun- 
teers? What is the standard deviation? 

b. Determine the probability that the 
number of students the kinesiology 
professor will need to ask is within one 
standard deviation of the mean. 


122. 
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Refer back to the communication system of 
Example 3.51. Suppose a voice packet can 
be transmitted a maximum of 10 times; i.e., 
if the 10th attempt fails, no 11th attempt is 
made to re-transmit the voice packet. Let 
xX =the number of times a message is 
transmitted. Assuming each transmission 
succeeds with probability p, determine the 
pmf of X. Then obtain an expression for the 
expected number of times a packet is 
transmitted. 


Newton’s generalization of the binomial 
theorem can be used to show that, for any 
positive integer r, 


iegrey: Gey 


k=0 


Use this to derive the negative binomial 
mef presented in this section. Then obtain 
the mean and variance of a negative bino- 
mial rv using this mef. 


If X is a negative binomial rv, then the rv 
Y = X — r is the total number of failures 
preceding the 7th success. (As mentioned in 
this section, Y is also sometimes called a 
negative binomial rv.) 


a. Use an argument similar to the one 
presented in this section to derive the 
pmf of Y. 

b. Obtain the mgf of Y. [Hint: Use the 
mef of X and the fact that Y= X — r.] 

c. Determine the mean and variance of 
Y. Are these intuitively consistent with 
the expressions for E(X) and V(X)? 
Explain. 


3.8 Simulation of Discrete Random Variables 


Probability calculations for complex systems often depend on the behavior of various random 
variables. When such calculations are difficult or impossible, simulation is the fallback strategy. In 
this section, we give a general method for simulating an arbitrary discrete random variable and 
consider implementations in existing software for simulating common discrete distributions. 
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Example 3.52 Let X = the amount of memory (GB) in a purchased flash drive, and suppose X has 
the following pmf: 


x 16 32 64 128 256 
po) 05 10 35 40 10 


We wish to simulate X. Recall from Section 2.6 that we begin with a “standard uniform” random 
number generator, i.e., a software function that generates evenly distributed numbers in the interval 
[0, 1). Our goal is to convert these decimals into the values of X with the probabilities specified by its 
pmf: 5% 16’s, 10% 32’s, 35% 64’s, and so on. To that end, we partition the interval [0, 1) according 
to these percentages: [0, .05) has probability .05; [.05, .15) has probability .1, since the length of the 
interval is .1; [.15, .50) has probability .50 — .15 = .35; etc. Proceed as follows: given a value u from 
the RNG, 


e If0 < u<.05, assign the value 16 to the variable x. 
e If 05 < uw<_.15, assign x = 32. 

e If 15 < u<.50, assign x = 64. 

e If 50 < u<.90, assign x = 128. 

e If 90 < u< 1, assign x = 256. 


Repeating this algorithm n times gives n simulated values of X. An R program that implements this 
algorithm appears in Figure 3.10; it returns a vector, x, containing n = 10,000 simulated values of the 
specified distribution. 


Figure 3.10 R simulation code 


Figure 3.11 (p. 175) shows a graph of the results of executing the above code, in the form of a 
histogram: the height of each rectangle corresponds to the relative frequency of each x value in the 
simulation (i.e., the number of times that value occurred, divided by 10,000). The exact pmf of X is 
superimposed for comparison; as expected, simulation results are similar, but not identical, to the 
theoretical distribution. 

Later in this section, we will present a faster, built-in way to simulate discrete distributions in R. 
The method introduced above will, however, prove useful in adapting to the case of continuous 
random variables in Chapter 4. 
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Figure 3.11 Simulation and exact distribution for Example 3.52 12; 


In the preceding example, the selected subintervals of [0, 1) were not our only choices—any five 
intervals with lengths .05, .10, .35, .40, and .10 would produce the desired result. However, those 
particular five subintervals have one desirable feature: the “cut points” for the intervals (i.e., 0, .05, 
.15, .50, .90, and 1) are precisely the possible heights of the graph of the cdf, F(x). This permits a 
geometric interpretation of the algorithm, which can be seen in Figure 3.12. The value u provided by 
the RNG corresponds to a position on the vertical axis between 0 and 1; we then “invert” the cdf by 
matching this u-value back to one of the gaps in the graph of F(x), denoted by dashed lines in 
Figure 3.12. If the gap occurs at horizontal position x, then x is our simulated value of the rv X for that 
run of the simulation. This is often referred to as the inverse cdf method for simulating discrete 
random variables. The general method is spelled out in the accompanying box. 


F(x) 


01632 64 128 256 


Figure 3.12 The inverse cdf method for Example 3.52 
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Inverse cdf Method for =| et X be a discrete random variable taking on values x; < x» <... with 

Simulating Discrete corresponding probabilities p,, po, .... Define Fy = 0; F; = F(x) = py; 

Random Variables F, = F(x) = py + p2; and, in general, Fy = F(x.) =pit--:+pr= 
Fy-1 + px. To simulate a value of X, proceed as follows: 


1. Use an RNG to produce a value, u, from [0, 1). 
2. If Fi, < u< F,, then assign x = x, 


Example 3.53 (Example 3.52 continued): Suppose the prices for the flash drives, in increasing order 
of memory size, are $10, $15, $20, $25, and $30. If the store sells 80 flash drives in a week, what’s 
the probability they will make a gross profit of at least $1800? 

Let Y = the amount spent on a flash drive, which has the following pmnf: 


y 10 15 20 25 30 
py) 05 10 35 40 10 


The gross profit for 80 purchases is the sum of 80 values from this distribution. Let A = {gross 
profit > $1800}. We can use simulation to estimate P(A), as follows: 


0. Set a counter for the number of times A occurs to zero. 

Repeat n times: 

1. Simulate 80 values y;, ..., ygo from the above pmf (using, e.g., an inverse cdf program similar 

to the one displayed in Figure 3.10). 

2. Compute the week’s gross profit, g = yj + --- +ygo. 

3. If g > 1800, add 1 to the count of occurrences for A. 

Once the n runs are complete, then P(A) = (count of the occurrences of A)/n. 

Figure 3.13 shows the resulting values of g for n = 10,000 simulations in R. In effect, our program 
is simulating a random variable G= Y, + --- + Ygq whose pmf is not known (in light of all the 
possible G values, it would not be worthwhile to attempt to determine its pmf analytically). The 
highlighted bars in Figure 3.13 correspond to g values of at least $1800; in our simulation, such 
values occurred 1940 times. Thus, P(A) = 1940/10, 000 = .194, with an estimated standard error of 


/.194(1 — .194) /10, 000 = .004. 
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Figure 3.13 Simulated distribution of weekly gross profit for Example 3.53 a 
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Simulations Implemented in R 

Earlier in this section, we presented the inverse cdf method as a general way to simulate discrete 
distributions applicable in any software. In fact, one can simulate generic discrete rvs in R by clever 
use of the built-in sample function. We saw this function in the context of probability simulation in 
Chapter 2. The sample function is designed to generate a random sample from any selected set of 
values (even including text values, if desired); the “clever” part is that it can accommodate a set of 
weights. The following short example illustrates their use. 

To simulate, say, 35 values from the pmf in Example 3.53, one can use the following code in 
R: sample (c(10,15,20,25,30),35,TRUE,c(.05,.10,.35,.40,.10)). The func- 
tion takes four arguments: the list of y values, the desired number of simulated values (the “sample 
size”), whether to sample with replacement (here, TRUE), and the list of probabilities in the same 
order as the y values. 

Thanks to the ubiquity of the binomial, Poisson, and other distributions in probability modeling, 
many software packages have built-in tools for simulating values from these distributions. Table 3.6 
summarizes the relevant functions in R; the input argument size refers to the desired number of 
simulated values of the distribution. 


Table 3.6 Functions to simulate major discrete distributions in R 


Distribution R code 

Binomial rbinom (size, n, p) 
Poisson rpois (size, ) 
Hypergeometric rhyper (size, M, N—M, n) 
Negative binomial rnbinom (size, r, p) 


A word of warning (really, a reminder) about the way software treats the negative binomial 
distribution: R defines a negative binomial rv as the number of failures preceding the 7th success, 
which differs from our definition. Assuming you want to simulate the number of trials required to 
achieve r successes, execute the code in the last line of Table 3.6 and then add r to each value. 


Example 3.54 The number of customers shipping express mail packages at a certain store during 
any particular hour of the day is a Poisson rv with mean 5. Each such customer has 1, 2, 3, or 4 
packages with probabilities .4, .3, .2, and .1, respectively. Let’s carry out a simulation to estimate the 
probability that at most 10 packages are shipped during any particular hour. 

Define an event A = {at most 10 packages shipped in an hour}. Our simulation to estimate 
P(A) proceeds as follows. 


0. Set a counter for the number of times A occurs to zero. 

Repeat n times: 

1. Simulate the number of customers in an hour, X, which is Poisson with w = 5. 

2. For each of the X customers, simulate the number of packages shipped according to the pmf 
above. 

3. If the total number of packages shipped is at most 10, add | to the counter for A. 

R code to implement this simulation appear in Figure 3.14. 
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A <- 0 
for (i in 1:10000) { 
x<-rpois (1,5) 


packages <- sample(c(1,2,3,4),x, 
TRUE, C(.4) <3) -2,741)) 
if (sum(packages) <=10) { 
A<-A+tl1 


} 


Figure 3.14 R simulation code for Example 3.54 


In R, 10,000 simulations resulted in 10 or fewer packages 5752 times, for an estimated probability 
of P(A) = .5752, with an estimated standard error of \/.5752(1 — .5752)/10, 000 = .0049. a 


Simulation Mean, Standard Deviation, and Precision 

In Section 2.6 and in the preceding examples, we used simulation to estimate the probability of an 
event. But consider the “gross profit” variable in Example 3.53: since we have 10,000 simulated 
values of this variable, we should be able to estimate its mean and its standard deviation. In general, 
suppose we have simulated n values x, ..., x, of a random variable X. Then, not surprisingly, we 
estimate wy and ox with the sample mean x and sample standard deviation s, respectively, of the 
n simulated values. 

In Section 2.6, we introduced the standard error of an estimated probability, which quantifies the 
precision of a simulation result P(A) as an estimate of a “true” probability P(A). By analogy, it is 
possible to quantify the amount by which a sample mean, x, will generally differ from the corre- 
sponding expected value w. For n simulated values of a random variable, with sample standard 
deviation s, the (estimated) standard error of the mean is 


Estimated standard error of the mean = —— (3.22) 


Vn 


Expression (3.22) will be derived in Chapter 6. As with an estimated probability, (3.22) indicates that 
the precision of x increases (i.e., its standard error decreases) as n increases, but not very quickly. To 
increase the precision of x as an estimate of uw by a factor of 10 (one decimal place) requires increasing 
the number of simulation runs, n, by a factor of 100. Unfortunately, there is no general formula for the 
standard error of s as an estimate of o. 


Example 3.55 (Example 3.54 continued) The 10,000 simulated values of the random variable G, 
which we denote by gj, ..., 10000, are displayed in the histogram in Figure 3.13. From these 
simulated values, we can estimate both the expected value and standard deviation of G: 


1 10,000 
S> gi = 1759.62 
i=1 


Me = 8 = 79 000 


1 10,000 5 1 10,000 5 
en an ee Hg) = y/— = ; — 1759.62)? = 43.50 
v6 = ** \| 10,000 — 1 2, (8 8) = \) 5999 S (s ) 


We estimate that the average weekly gross profit from flash drive sales is $1759.62, with a standard 
deviation of $43.50. 
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Applying (3.22), the (estimated) standard error of 2 is s/./n = 43.50/./10,000 = 0.435. If 10,000 
runs are used to simulate G, it’s estimated that the resulting sample mean will differ from E(G) by 
roughly 0.435. (In contrast, the sample standard deviation, s, estimates that the gross profit for a single 
week—i.e., a single observation g—typically differs from E(G) by about $43.50.) 

In Chapter 5, we will see how the expected value and variance of random variables like G, that are 
sums of a fixed number of other rvs, can be obtained analytically. a 


Example 3.56 The “help desk” at a university’s computer center receives both hardware and 
software queries. Let X and Y be the number of hardware and software queries, respectively, in a 
given day. Each can be modeled by a Poisson distribution with mean 20. Because computer center 
employees need to be allocated efficiently, of interest is the difference between the sizes of the two 
queues: D = |X — Yj. Let’s use simulation to estimate (1) the probability the queue sizes differ by 
more than 5; (2) the expected difference; (3) the standard deviation of the difference. 

Figure 3.15 shows R code to simulate this process. The code exploits the built-in Poisson simu- 
lator, as well as the fact that 10,000 simulated values may be called simultaneously. 


X<-rpois (10000, 20) 
Y<-rpois (10000, 20) 
D<-abs (X-Y) 

sum ((D>5) ) 

mean (D) 

sd(D) 


Figure 3.15 R simulation code for Example 3.56 


The line sum((D>5)) performs two operations: first, (D>5) determines if each simulated 
d value exceeds 5, returning a vector of logical bits; second, sum() tallies the “success” bits (1’s or 
TRUEs) and gives a count of the number of times the event {D > 5} occurred in the 10,000 
simulations. The results from one run were 


3843 


PID > 5) = 75 59 


= 3843 fp =d =5.0380 Gp =s = 3.8436 


A histogram of the simulated values of D appears in Figure 3.16. 
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Section 3.8 Exercises (125-137) 

125. Consider the pmf given in Exercise 29 for that they are in random order. Of interest is 
the random variable Y= the number of how many of these will be in their “correct” 
moving violations for which the a randomly positions (e.g., item #5 situated at the 5th 
selected insured individual was cited during position in the sequence, etc.) after shuf- 
the last 3 years. Write a program to simu- fling. 
late this random variable, then use your a. Write program that simulates a permu- 
simulation to estimate E(Y) and SD(Y). tation of the numbers | to N and then 
How do these compare to the exact values records the value of the variable 
of E(Y) and SD(Y)? X=number of items in the correct 

126. Consider the pmf given in Exercise 31 for position. 
the random variable X = capacity of a b. Set N = 5 in your program, and use at 
purchased freezer. Write a program to least 10,000 simulations to estimate 
simulate this random variable, then use E(X), the expected number of items in 
your simulation to estimate both E(X) and the correct position. 

SD(X). How do these compare to the exact c. Set N = 52 in your program (as if you 
values of E(X) and SD(X)? were shuffling a deck of cards), and use 

127. Suppose person after person is tested for at least 10,000 simulations to estimate 
the presence of a certain characteristic. The E(X). What do you discover? Is this 
probability that any individual tests positive surprising? 
is .75. Let X = the number of people who 129, Exercise 101 of Chapter 2 referred to a 
must be tested to obtain five consecutive multiple-choice exam in which 10 of the 
positive test results. Use simulation to questions have two options, 13 have three 
estimate P(X < 25). options, 13 have four options, and the other 

128. The matching problem. Suppose _ that 4 have five options. Let X = the number of 


N items labeled 1, 2, ..., N are shuffled so 
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130. 


131. 


132. 


questions a student gets right, assuming 
s/he is completely guessing. 


a. Write a program to simulate X, and use 
your program to estimate the mean and 
standard deviation of X. 

b. Estimate the probability a student will 
score at least one standard deviation 
above the mean. 


Example 3.53 of this section considered the 
gross profit G resulting from selling flash 
drives to 80 customers per week. Of course, 
it isn’t realistic for the number of customers 
to remain fixed from week to week. So, 
instead, imagine the number of customers 
buying flash drives in a week follows a 
Poisson distribution with mean 80, and that 
the amount paid by each customer follows 
the distribution for Y provided in that 
example. Write a program to simulate the 
random variable G, and use your simulation 
to estimate 


a. The probability that weekly gross sales 
are at least $1800. 

b. The mean of G. 

c. The standard deviation of G. 


Exercise 19 investigated Benford’s law, a 

discrete distribution with pmf given by 

P(x) = logio((x + L/x) for x = 1, 2, ..., 9. 

Use the inverse cdf method to write a pro- 

gram that simulates the Benford’s law dis- 

tribution. Then use your program to 
estimate the expected value and variance of 
this distribution. 

Recall that a geometric rv has pmf p(x) = 

p(1—p)* | for x = 1, 2, 3, .... In Example 

3.12, it was shown that the cdf of this dis- 

tribution is F(x) = 1 — (1 — p)* for positive 

integers x. 

a. Write a program that implements the 
inverse cdf method to simulate a geo- 
metric distribution. Your program 
should have as inputs the numerical 
value of p and the desired sample size. 

b. Use your program to simulate 10,000 
values from a geometric rv X with 


133. 
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p = .85. From these values, estimate 
each of the following: P(X < 2), 
E(X), SD(X). How do these compare to 
the corresponding exact values? 


Tickets for a particular flight are $250 
apiece. The plane seats 120 passengers, but 
the airline will knowingly overbook (i.e., 
sell more than 120 tickets), because not 
every paid passenger shows up. Let f de- 
note the number of tickets the airline sells 
for this flight, and assume the number of 
passengers that actually show up for the 
flight, X, follows a Bin(¢, .85) distribution. 
Let B = the number of paid passengers who 
show up at the airport but are denied a seat 
on the plane, so B = X — 120 if X > 120 
and B=0 otherwise. If the airline must 
compensate these passengers with $500 
apiece, then the profit the airline makes on 
this flight is 250t — 500B. (Notice that f is 
fixed, but B is random.) 


a. Write a program to simulate this sce- 

Specifically, your program 
should take in ¢ as an input and return 
many values of the profit variable 
250t — 500B, where B is described 
above. 

b. The airline wishes to determine the 
optimal value of f, i.e., the number of 
tickets to sell that will maximize their 
expected profit. Run your program for 
t= 140, 141, ..., 150, and record the 
average profit from many runs under 
each of these settings. What value of 
t appears to return the largest value? 
[Note: If a clear winner does not 
emerge, you might need to increase the 
number of runs for each t value! ] 


nario. 


Imagine the following simple game: flip a 
fair coin repeatedly, winning $1 for every 
head and losing $1 for every tail. Your net 
winnings will potentially oscillate between 
positive and negative numbers as play 
continues. How many times do you think 
net winnings will change signs in, say, 
1000 coin flips? 5000 flips? 
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a. Let X = the number of sign changes in 
1000 coin flips. Write a program to 
simulate X, and use your program to 
estimate the probability of at least 10 
sign changes. 

b. Use your program to estimate both 
E(X) and SD(X). Does your estimate 
for E(X) match your intuition for the 
number of sign changes? 

c. Repeat parts (a)-(b) with 5000 flips. 


135. Exercise 40 describes the game Plinko from 


The Price is Right. Each contestant drops 
between one and 5 chips down the Plinko 
board, depending on how well s/he prices 
several small items. Suppose the random 
variable C = number of chips earned by a 
contestant has the following distribution: 


c 1 2 3 4 5 
plc) 03S 3584 


The winnings from each chip follow the 
distribution presented in Exercise 40. Write 
a program to simulate Plinko; you will need 
to consider both the number of chips a 
contestant earns and how much money is 
won on each of those chips. Use your 
simulation to estimate the answers to the 
following questions: 


a. What is the probability a contestant 
wins more than $11,000? 


b. What is a_ contestant’s expected 
winnings? 

c. What is the corresponding standard 
deviation? 


d. In fact, a player gets one Plinko chip 
for free and can earn the other four by 
guessing the prices of small items 
(waffle irons, alarm clocks, etc.). 
Assume the player has a 50-50 chance 
of getting each price correct, so 
we may write C = 1 + R, where R ~ 
Bin(4, .5). Use this revised model for 
C to estimate the answers to (a)—(c). 


136. 
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Recall the Coupon Collector’s Problem 
described in Exercise 106 of Chapter 2. Let 
X = the number of cereal boxes purchased 
in order to obtain all 10 coupons. 


a. Use a simulation program to estimate 
E(X) and SD(X). Also compute the 
estimated standard error of your answer. 

b. Repeat (a) with 20 coupons required 
instead of 10. Does it appear to take 
roughly twice as long to collect 20 
coupons as 10? More than twice as 
long? Less? 


A small high school holds its graduation 
ceremony in the gym. Because of seating 
constraints, students are limited to a maxi- 
mum of four tickets to graduation for family 
and friends. Suppose 30% of students want 
four tickets, 25% want three, 25% want 
two, 15% want one, and 5% want none. 


a. Write a simulation for 150 graduates 
requesting tickets, where students’ 
requests follow the distribution descri- 
bed above. In particular, keep track of 
the variable T = the total number of 
tickets requested by these 150 students. 

b. The gym can seat a maximum of 410 
guests. Based on your simulation, 
estimate the probability that all stu- 
dents’ requests can be accommodated. 


Supplementary Exercises: (138-169) 


138. 


Consider a deck consisting of seven cards, 
marked 1, 2, ..., 7. Three of these cards are 
selected at random. Define a rv W by 
W = the sum of the resulting numbers, and 
compute the pmf of W. Then compute p 
and a”. [Hint: Consider outcomes as unor- 
dered, so that (1, 3, 7) and (3, 1, 7) are not 
different outcomes. Then there are 35 out- 
comes, and they can be listed.] (This type 
of rv actually arises in connection with 
Wilcoxon’s rank-sum test, in which there is 
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an x sample and a y sample and W is the 
sum of the ranks of the x’s in the combined 
sample.) 

After shuffling a deck of 52 cards, a dealer 
deals out 5. Let X = the number of suits 
represented in the five-card hand. 


a. Show that the pmf of X is 


x 1 2 3 4 
p(x) 002 146 588 264 
[Hint: pi) =4P(all spades), p(2) = 


6P(only spades and hearts with at least 
one of each), and p(4) = 4P(2 spades 1 
one of each other suit).] 

b. Compute 1, o’, and o. 


Let X be a rv with mean yw. Show that 
E(X’) > yu’, and that E(X*) > u* unless 
X is a constant. [Hint: Consider variance. ] 
Of all customers purchasing automatic 
garage-door openers, 75% purchase a 
chain-driven model. Let X = the number 
among the next 15 purchasers who select 
the chain-driven model. 


. What is the pmf of X? 

. Compute P(X > 10). 

. Compute P(6 < X < 10). 

. Compute “ and oe 

. If the store currently has in stock 10 
chain-driven models and 8 shaft-driven 
models, what is the probability that the 
requests of these 15 customers can all be 
met from existing stock? 


onan tf 


A friend recently planned a camping 
trip. He had two flashlights, one that 
required a single 6-V battery and another 
that used two size-D batteries. He had pre- 
viously packed two 6-V and four size-D 
batteries in his camper. Suppose the prob- 
ability that any particular battery works is 
p and that batteries work or fail indepen- 
dently of one another. Our friend wants to 
take just one flashlight. For what values of 
p should he take the 6-V flashlight? 


Binary data is transmitted over a noisy 
communication channel. The probability 
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that a received binary digit is in error due to 
channel noise is 0.05. Assume that such 
errors occur independently within the bit 
stream. 


a. What is the probability that the 3rd error 
occurs on the 50th transmitted bit? 

b. On average, how many bits will be 
transmitted correctly before the first 
error? 

c. Consider a 32-bit “word.” What is the 
probability of exactly 2 errors in this 
word? 

d. Consider the next 10,000 bits. What 
approximating model could we use for 
X = the number of errors in these 10,000 
bits? Give the name of the model and the 
value(s) of the parameter(s). 


A manufacturer of flashlight batteries 
wishes to control the quality of its product 
by rejecting any lot in which the propor- 
tion of batteries having unacceptable volt- 
age appears to be too high. To this end, 
out of each large lot (10,000 batteries), 25 
will be selected and tested. If at least 5 of 
these generate an unacceptable voltage, the 
entire lot will be rejected. What is the 
probability that a lot will be rejected if 


a. Five percent of the batteries in the lot 
have unacceptable voltages? 

b. Ten percent of the batteries in the lot 
have unacceptable voltages? 

c. Twenty percent of the batteries in the lot 
have unacceptable voltages? 

d. What would happen to the probabilities 
in parts (a)-(c) if the critical rejection 
number were increased from 5 to 6? 

Of the people passing through an airport 

metal detector, .5% activate it; let X = the 

number among a randomly selected group 
of 500 who activate the detector. 


a. What is the (approximate) pmf of X? 
b. Compute P(X = 5). 
c. Compute P(S < X). 
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An educational consulting firm is trying to 
decide whether high school students who 
have never before used a hand-held calcu- 
lator can solve a certain type of problem 
more easily with a calculator that uses 
reverse Polish logic or one that does not use 
this logic. A sample of 25 students is 
selected and allowed to practice on both 
calculators. Then each student is asked to 
work one problem on the reverse Polish 
calculator and a similar problem on the 

other. Let p = P(S), where S indicates that a 

student worked the problem more quickly 

using reverse Polish logic than without, and 

let X = number of S’s. 

a. If p = .5, what is P77 < X < 18)? 

b. If p = .8, what is P77 < X < 18)? 

c. If the claim that p = .5 is to be rejected 
when either X < 7 or X > 18, what is 
the probability of rejecting the claim 
when it is actually correct? 

d. If the decision to reject the claim p = .5 
is made as in part (c), what is the 
probability that the claim is not rejected 
when p = .6? When p = .8? 

e. What decision rule would you choose 
for rejecting the claim p=.5 if you 
wanted the probability in part (c) to be 
at most .01? 


Consider a disease whose presence can be 
identified by carrying out a blood test. Let 
p denote the probability that a randomly 
selected individual has the disease. Suppose 
n individuals are independently selected for 
testing. One way to proceed is to carry out a 
separate test on each of the n blood sam- 
ples. A potentially more economical 
approach, group testing, was introduced 
during World War II to identify syphilitic 
men among army inductees. First, take a 
part of each blood sample, combine these 
specimens, and carry out a single test. If no 
one has the disease, the result will be neg- 
ative, and only the one test is required. If at 
least one individual is diseased, the test on 
the combined sample will yield a positive 
result, in which case the n individual tests 


148. 
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are then carried out. If p = .1 and n = 3, 
what is the expected number of tests using 
this procedure? What is the expected 
number when n = 5? [The article “Random 
Multiple-Access | Communication and 
Group Testing” (EEE Trans. Commun. 
1984: 769-774) applied these ideas to a 
communication system in which the 
dichotomy was active/idle user rather than 
diseased/nondiseased. ] 


Let p,; denote the probability that any par- 
ticular code symbol is erroneously trans- 
mitted through a communication system. 
Assume that on different symbols, errors 
occur independently of one another. Sup- 
pose also that with probability p2 an erro- 
neous symbol is corrected upon receipt. Let 
X denote the number of correct symbols in 
a message block consisting of n symbols 
(after the correction process has ended). 
What is the probability distribution of X? 


The purchaser of a power-generating unit 
requires c consecutive successful start-ups 
before the unit will be accepted. Assume that 
the outcomes of individual start-ups are 
independent of one another. Let p denote the 
probability that any particular start-up is 
successful. The random variable of interest is 
X = the number of start-ups that must be 
made prior to acceptance. Give the pmf 
of X for the case c = 2. If p = .9, what is 
P(X < 8)? [Hint: For x > 5, express 
p(x) “recursively” in terms of the pmf 
evaluated at the smaller values x — 3, x — 4, 
.... 2.] (This problem was suggested by 
the article “Evaluation of a Start-Up 
Demonstration Test,” J. Qual. Tech. 1983: 
103-106.) 


A plan for an executive travelers’ club has 
been developed by an airline on the premise 
that 10% of its current customers would 
qualify for membership. 


a. Assuming the validity of this premise, 
among 25 randomly selected current 
customers, what is the probability that 
between 2 and 6 (inclusive) qualify for 
membership? 
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b. Again assuming the validity of the 
premise, what are the expected number 
of customers who qualify and the stan- 
dard deviation of the number who 
qualify in a random sample of 100 
current customers? 

c. Let X denote the number in a random 
sample of 25 current customers who 
qualify for membership. Consider 
rejecting the company’s premise in favor 
of the claim that p > .10ifx > 7. What 
is the probability that the company’s 
premise is rejected when it is actually 
valid? 

d. Refer to the decision rule introduced in 
part (c). What is the probability that the 
company’s premise is not rejected even 
though p = .20 (ie., 20% qualify)? 


Forty percent of seeds from maize (modern- 
day corn) ears carry single spikelets, and 
the other 60% carry paired spikelets. A seed 
with single spikelets will produce an ear 
with single spikelets 29% of the time, 
whereas a seed with paired spikelets will 
produce an ear with single spikelets 26% of 
the time. Consider randomly selecting ten 
seeds. 


a. What is the probability that exactly five 
of these seeds carry a single spikelet and 
produce an ear with a single spikelet? 

b. What is the probability that exactly five 
of the ears produced by these seeds have 
single spikelets? What is the probability 
that at most five ears have single 
spikelets? 

A trial has just resulted in a hung jury 
because eight members of the jury were in 
favor of a guilty verdict and the other four 
were for acquittal. If the jurors leave the 
jury room in random order and each of the 
first four leaving the room is accosted by a 
reporter in quest of an interview, what is the 
pmf of X = the number of jurors favoring 
acquittal among those interviewed? How 
many of those favoring acquittal do you 
expect to be interviewed? 
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A reservation service employs five infor- 
mation operators who receive requests for 
information independently of one another, 
each according to a Poisson process with 
rate 2 = 2/min. 


a. What is the probability that during a 
given l-min period, the first operator 
receives no requests? 

b. What is the probability that during a 
given |-min period, exactly four of the 
five operators receive no requests? 

c. Write an expression for the probability 
that during a given 1-min period, all of 
the operators receive exactly the same 
number of requests. 


Grasshoppers are distributed at random in a 
large field according to a Poisson distribu- 
tion with parameter 2 = 2 per square yard. 
How large should the radius R of a circular 
sampling region be taken so that the prob- 
ability of finding at least one in the region 
equals .99? 


A newsstand has ordered five copies of a 
certain issue of a photography magazine. 
Let X =the number of individuals who 
come in to purchase this magazine. If X has 
a Poisson distribution with parameter 
u=4, what is the expected number of 
copies that are sold? 


Individuals A and B begin to play a 
sequence of chess games. Let S = {A wins 
a game}, and suppose that outcomes of 
successive games are independent with 
P(S)=p and P(F)=1-—p (they never 
draw). They will play until one of them 
wins ten games. Let X = the number of 
games played (with possible values 10, 11, 
wey 19). 


a. For x=10, 11, ..., 19, obtain an 
expression for p(x) = P(X = x). 

b. If a draw is possible, with p = P(S), 
q=P(F), 1—p-—q=P(draw), what 
are the possible values of X? What is 
PQ0 < X)? [Hint PQ0 < X)= 
1 — PX < 20).] 
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. A test for the presence of a disease has 


probability .20 of giving a false-positive 
reading (indicating that an individual has 
the disease when this is not the case) and 
probability .10 of giving a false-negative 
result. Suppose that ten individuals are 
tested, five of whom have the disease and 
five of whom do not. Let X = the number 
of positive readings that result. 


a. Does X have a binomial distribution? 
Explain your reasoning. 

b. What is the probability that exactly 
three of the ten test results are positive? 

The generalized negative binomial pmf, in 

which r is not necessarily an integer, is 


k(r,x) x p"(1 — p)" 


a l)(x+r—2)---@+r—x) 
(rx) = Fr é 
1 x=0 


Let X, the number of plants of a certain 
species found in a particular region, have 
this distribution with p = .3 and r= 2.5. 
What is P(X = 4)? What is the probability 
that at least one plant is found? 


A small publisher employs two typesetters. 
The number of errors (in one book) made 
by the first typesetter has a Poisson distri- 
bution mean “;, the number of errors made 
by the second typesetter has a Poisson 
distribution with mean w>, and each type- 
setter works on the same number of books. 
Then if one such book is randomly selec- 
ted, the function 


PS fy; Ho) = Se! ual + 5e 2 Hy 
x! x! 
x=0,1,2,... 


gives the pmf of X = the number of errors 
in the selected book. 
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a. Verify that p@; “1, M2) is a legitimate 
pmf (> 0 and sums to 1). 

b. What is the expected number of errors 
in the selected book? 

c. What is the standard deviation of the 
number of errors in the selected book? 

d. How does the pmf change if the first 
typesetter works on 60% of all such 
books and the second typesetter works 
on the other 40%? 


The mode of a discrete random variable 
X with pmf p(x) is that value x* for which 
p(x) is largest (the most probable x value). 


a. Let X ~ Bin(n, p). By considering the 
ratio b& + 1; n, p)/b(x; n, p), show that 
b(x; n, p) increases with x as long as 
x<np-—(1-—p). Conclude that the 
mode x* is the integer satisfying (n + 1) 
p-1 <x* <(n+ bp. 

b. Show that if X has a Poisson distribution 
with parameter , the mode is the lar- 
gest integer less than w. If w is an inte- 
ger, show that both w— 1 and yw are 
modes. 


For a particular insurance policy the number 
of claims by a policy holder in 5 years is 
Poisson distributed. If the filing of one claim 
is four times as likely as the filing of 
two claims, find the expected number of 
claims. 


If X is a hypergeometric rv, show directly 
from the definition that E(X) = nM/N (con- 
sider only the case n < M). [Hint: Factor 
nMI/N out of the sum for E(X), and show 
that the terms inside the sum are of the 
form h(y; n — 1, M — 1, N — 1), where y = 
x-1.] 


Use the fact that 


do &- 0)’P() = 


all x 


(x — #)"p(x) 


a 


x:|x—p| > ko 


to prove Chebyshev’s inequality, given in 
Exercise 45 of this chapter. 
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is characterized by a constant rate / at 
which events occur per unit time. A gener- 
alization is to suppose that the probability 
of exactly one event occurring in the 
interval (t, t+ At) is A(t) - At + o(Ad) for 
some function A(f). It can then be shown 
that the number of events occurring during 
an interval [t), f2] has a Poisson distribution 
with parameter 


h 
= / A(t)dt 
q 


The occurrence of events over time in this 
situation is called a nonhomogeneous 
Poisson process. The article “Inference 
Based on Retrospective Ascertainment,” 
J. Amer. Statist. Assoc. 1989: 360-372, 
considers the intensity function 


A(t) = ett ot 


as appropriate for events involving trans- 
mission of HIV via blood transfusions. 
Suppose that a= 2 and b= .6 (close to 
values suggested in the paper), with time in 
years. 


a. What is the expected number of events 
in the interval [0, 4]? In [2, 6]? 

b. What is the probability that at most 15 
events occur in the interval [0, .9907]? 


Suppose a store sells two different coffee 
makers of a particular brand, a basic model 
selling for $30 and a fancy one selling for 
$50. Let X denote the number of people 
among the next 25 purchasing this brand 
who choose the more expensive model. 
Then h(X) = revenue = 50X + 30(25 — X) 
= 20X + 750, a linear function. If the 
choices are independent and have the same 
probability, then how is X distributed? 
Find the mean and standard deviation of 
h(X). Explain why the choices might not 
be independent with the same probability. 
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Let X be a discrete rv with possible values 
0, 1, 2,. .. or some subset of these. The 
function wy(s) = E(s*) = S>9s*- p(x) is 
called the probability generating function 
(pgf) of X. 


a. Suppose X is the number of children 
born to a family, and p(O) = .2, pC) 
= .5, and p(2) = .3. Determine the pgf 
of X. 

b. Determine the pgf when X has a Poisson 
distribution with parameter yw. 

c. Show that (1) = 1. 

d. Show that w’(0) = p(1). (You'll need to 
assume that the derivative can be 
brought inside the summation, which is 
justified.) What results from taking the 
second derivative with respect to s and 
evaluating at s = 0? The third deriva- 
tive? Explain how successive differen- 
tiation of y(s) and evaluation at s = 0 
“generates the probabilities in the dis- 
tribution.” Use this to recapture the 
probabilities of (a) from the pgf. [Note: 
This shows that the pgf contains all the 
information about the distribution— 
knowing wW(s) is equivalent to knowing 
px).| 

Three couples and two single individuals 

have been invited to a dinner party. Assume 

independence of arrivals to the party, and 
suppose that the probability of any partic- 
ular individual or any particular couple 

arriving late is .4 (the two members of a 

couple arrive together). Let X = the number 

of people who show up late for the party. 

Determine the pmf of X. 


Consider a sequence of identical and inde- 
pendent trials, each of which will be a success 
S or failure F. Let p = P(S) and q = P(F). 


a. Define arandom variable X as the number 
of trials necessary to obtain the first S, a 
geometric random variable. Here is an 
alternative approach to determining F(X). 
Just as P(B) = P(BIA)P(A) +P(BIA) P(A), 
it can be shown that 
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E(X) = E(X|A)P(A) + E(X|A’) P(A’) 


where E(X|A) denotes the expected 
value of X given that the event A has 
occurred. Now let A = {S on Ist trial}. 
Show again that E(X) = I/p. [Hint: 
Denote E(X) by yp. Then given that 
the first trial is a failure, one trial has 
been performed and, starting from the 
second trial, we are still looking for the 
first S. This implies that E(X|A’) = 
E(X|F) = 14+ pw] 


. The expected value property in (a) can 


be extended to any partition Ay, Ao, ..., 
A; of the sample space: 


Now let Y = the number of trials neces- 
sary to obtain two consecutive S’s. It is 
not possible to determine E(Y) directly 
from the definition of expected value, 
because there is no formula for the pmf of 
Y; the complication is the word consec- 
utive. Use the weighted average formula 
to determine E(Y). [Hint: Consider the 
partition with k=3 and A, = {F}, 
Az = {SS}, A3 = {SF}.] 


169. For a discrete rv X taking values in 


{0, 1, 2, 3, ...}, we shall derive the fol- 
lowing alternative formula for the mean: 


we = 1 - F()] 


x=0 


a. Suppose for now the range of X is 


{0, 1, ..., N} for some positive integer 
N. By re-grouping terms, show that 


N 
2 bk Pt)] = p(1) + p(2) + p(3) + --- +p(N) 
p(2) + p(3) P(N) 
P(3) P(N) 
+p(N) 
b. Re-write each row in the above expression 
in terms of the cdf of X, and use this to 
establish that 
N N-1 
[x- p(x)] = >, [1 — F@)] 
x=0 x=0 
c. Let N — oo in part (b) to establish the 


desired result, and explain why the 
resulting formula works even if the max- 
imum value of X is finite. [Hint: If the 
largest possible value of X is N, what does 
1— F(x) equal for x > N?] (This 
derivation also implies that a discrete rv 
X has a finite mean iff the series 
>> [1 — F(x)] converges.) 


. Let X have a geometric distribution with 


parameter p. Use the cdf of X and the 
alternative mean formula just derived to 
determine sly. 


®) 


Check for 
updates 


Introduction 

As mentioned at the beginning of Chapter 3, the two important types of random variables are discrete 
and continuous. In this chapter, we study the second general type of random variable that arises in 
many applied problems. Sections 4.1 and 4.2 present the basic definitions and properties of continuous 
random variables, their probability distributions, and their various expected values. In Section 4.3, we 
study in detail the normal distribution, arguably the most important and useful in probability and 
statistics. Sections 4.4 and 4.5 discuss some other continuous distributions that are often used in 
applied work. In Section 4.6, we introduce a method for assessing whether given sample data is 
consistent with a specified distribution. Section 4.7 presents methods for obtaining the distribution of a 
rv Y from the distribution of X when the two are related by some equation Y = g(X). The last section is 
dedicated to the simulation of continuous rvs. 


4.1 Probability Density Functions and Cumulative Distribution Functions 


A discrete random variable (rv) is one whose possible values either constitute a finite set or else can 
be listed in an infinite sequence (a list in which there is a first element, a second element, etc.). 
A random variable whose set of possible values is an entire interval of numbers is not discrete. 

Recall from Chapter 3 that a random variable X is continuous if (1) possible values comprise either 
a single interval on the number line (for some A < B, any number x between A and B is a possible 
value) or a union of disjoint intervals, and (2) P(X = c) = 0 for any number c that is a possible value 
of X. 


Example 4.1 If in the study of the ecology of a lake, we make depth measurements at randomly 
chosen locations, then X = the depth at such a location is a continuous rv. Here A is the minimum 
depth in the region being sampled, and B is the maximum depth. a 


Example 4.2 If a chemical compound is randomly selected and its pH X is determined, then X is a 
continuous rv because any pH value between 0 and 14 is possible. If more is known about the 
compound selected for analysis, then the set of possible values might be a subinterval of [0, 14], such 
as 5.5 < x < 6.5, but X would still be continuous. fe] 
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Example 4.3 Let X represent the amount of time a randomly selected customer spends waiting for a 
haircut before his/her haircut commences. Your first thought might be that X is a continuous random 
variable, since a measurement is required to determine its value. However, there are customers lucky 
enough to have no wait whatsoever before climbing into the barber’s chair. So it must be the case that 
P(X = 0) > 0. Conditional on no chairs being empty, though, the waiting time will be continuous 
since X could then assume any value between some minimum possible time A and a maximum 
possible time B. This random variable is neither purely discrete nor purely continuous but instead is a 
mixture of the two types. a 


One might argue that although in principle variables such as height, weight, and temperature are 
continuous, in practice the limitations of our measuring instruments restrict us to a discrete (though 
sometimes very finely subdivided) world. However, continuous models often approximate real-world 
situations very well, and continuous mathematics (the calculus) is frequently easier to work with than 
the mathematics of discrete variables and distributions. 


Probability Distributions for Continuous Variables 

Suppose the variable X of interest is the depth of a lake at a randomly chosen point on the surface. Let 
M = the maximum depth (in meters), so that any number in the interval [0, M] is a possible value of 
X. If we “discretize” X by measuring depth to the nearest meter, then possible values are nonnegative 
integers less than or equal to M. The resulting discrete distribution of depth can be pictured using a 
probability histogram. If we draw the histogram so that the area of the rectangle above any possible 
integer k is the proportion of the lake whose depth is (to the nearest meter) k, then the total area of all 
rectangles is 1. A possible histogram appears in Figure 4. 1a. 

If depth is measured much more accurately and the same measurement axis as in Figure 4.1a is 
used, each rectangle in the resulting probability histogram is much narrower, although the total area of 
all rectangles is still 1. A possible histogram is pictured in Figure 4.1b; it has a much smoother 
appearance than that of Figure 4.la. If we continue in this way to measure depth more and more 
finely, the resulting sequence of histograms approaches a smooth curve, as pictured in Figure 4.1c. 
Because for each histogram the total area of all rectangles equals 1, the total area under the smooth 
curve is also 1. The probability that the depth at a randomly chosen point is between a and D is just the 
area under the smooth curve between a and D. It is exactly a smooth curve of this type that specifies a 
continuous probability distribution. 


0 M 0 M 0 M 


Figure 4.1 (a) Probability histogram of depth measured to the nearest meter; (b) probability histogram 
of depth measured to the nearest centimeter; (c) a limit of a sequence of discrete histograms 
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DEFINITION Let X be a continuous rv. Then a probability distribution or probability 
density function (pdf) of X is a function f(x) such that for any two numbers 
aand b witha < b, 


That is, the probability that X takes on a value in the interval [a, b] is the area 
above this interval and under the graph of the density function, as illustrated 
in Figure 4.2. The graph of f(x) is often referred to as the density curve. 


f(x) 


a b 


Figure 4.2 P(a < X < b) = the area under the density curve between a and b 
For f(x) to be a legitimate pdf, it must satisfy the following two conditions: 


1. ffx) > 0 for all x 
2. [> f(x)dx = [area under the entire graph of f(x)| = 1 


The support of a pdf f(x) consists of all x values for which f(x) > 0. Although a pdf is defined for 
—0o <x< 00, we will typically display a pdf for the values in its support, and it is always understood 
that f(x) = 0 otherwise. 


Example 4.4 The direction of an imperfection with respect to a reference line on a circular object 
such as a tire, brake rotor, or flywheel is, in general, subject to uncertainty. Consider the reference line 
connecting the valve stem on a tire to the center point, and let X be the angle measured clockwise to 
the location of an imperfection. One possible pdf for X is 


1 
———— < 
fx) = OS ¥<360 


The pdf is graphed in Figure 4.3. Clearly f(x) > 0. The area under the density curve is just the area of a 


rectangle: (height) (base) = (345) (360) = 1. The probability that the angle is between 90° and 180° is 


180 
P(90<X < 180) = / 
90 


x=180 
get ee ee 
360° 360!x-90 4 
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Shaded area = P(90 < X < 180) 
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Figure 4.3 The pdf and probability for Example 4.4 


The probability that the angle of occurrence is within 90° of the reference line is 
P(O<X <90) + P(270 < X <360) = .25+.25 = .50 a 


Because the pdf in Figure 4.3 is completely “level” (i.e., has a uniform height) on the interval 
[0, 360), X is said to have a uniform distribution. 


DEFINITION A continuous rv X is said to have a uniform distribution on the interval 
[A, B] if the pdf of X is 
f (x; A, B) — A<X<B 
Xx; = 
ts ? B _ A — —— 
The statement that X has a uniform distribution on [A, B] will be denoted 
X ~ Unif[A, B]. 


The graph of any uniform pdf looks like the graph in Figure 4.3 except that the interval of positive 
density is [A, B] rather than [0, 360). 

In the discrete case, a probability mass function (pmf) tells us how little “blobs” of probability 
mass of various magnitudes are distributed along the measurement axis. In the continuous case, 
probability density is “smeared” in a continuous fashion along the interval of possible values. When 
density is smeared uniformly over the interval, a uniform pdf, as in Figure 4.3, results. 

When X is a discrete random variable, each possible value is assigned positive probability. This is 
not true of a continuous random variable, because the area under a density curve that lies above any 
single value is zero: 


P(X =) =PlesX<e) = [ fla) dx =0 


The fact that P(X = c) = 0 when X is continuous has an important practical consequence: The 
probability that X lies in some interval between a and b does not depend on whether the lower limit 
a or the upper limit D is included in the probability calculation: 


P(a<X <b) = P(a<X <b) = P(a<X <b) = P(a<X<b) (4.1) 
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In contrast, if X were discrete and both a and b were possible values of X (e.g., X ~ Bin(20, .3) and 
a=5, b= 10), then all four of the probabilities in (4.1) would be different. This also means that 
whether we include the endpoints of the range of values for a continuous rv X is somewhat arbitrary; 
for example, the pdf in Example 3.4 could be defined to be positive on (0, 360) or [0, 360] rather than 
[0, 360), and the same applies for a uniform distribution on [A, B] in general. 

The zero probability condition has a physical analog. Consider a solid circular rod (with cross- 
sectional area of 1 in* for simplicity). Place the rod alongside a measurement axis and suppose that 
the density of the rod at any point x is given by the value f(x) of a density function. Then if the rod is 
sliced at points a and b and this segment is removed, the amount of mass removed is ai f(x)dx; 
however, if the rod is sliced just at the point c, no mass is removed. Mass is assigned to interval 
segments of the rod but not to individual points. 

So, if P(X = c) = 0 when X is a continuous rv, then what does f(c) represent? After all, if X were 
discrete, its pmf evaluated at x = c, p(c), would indicate the probability that X equals c. To help 
understand what f(c) means, consider a small window near x = c—say, [c, c + Ax]. Using a rectangle 
to approximate the area under f(x) between c and c + Ax (the usual “Riemann approximation” idea 
from calculus), one obtains [’ eae f(x)dx = Ax-f(c), from which 

z for" Fddx _ Plea XS Ay) 
f(e) w 2 = SS 


This indicates that f(c) is not a probability, but rather roughly the probability of an interval divided by 
the length of the chosen interval. If we associate mass with probability and remember that interval 
length is the one-dimensional analogue of volume, then f represents their quotient, mass per volume, 
more commonly known as density (hence, the name pdf). The height of the function f(x) at a 
particular point reflects how “dense” the values of X are near that point—taller sections of f(x) contain 
more probability within a fixed interval length than do shorter sections. 


Example 4.5 Climate change has made effective modeling and management of floodwaters ever 
more important in coastal areas. One variable of particular importance is the flow rate of water above 
some minimum threshold (typically where the rate becomes hazardous and requires intervention). The 
following pdf of X = hazardous flood rate (m*/s) is suggested under certain conditions by the article 
“A Framework for Probabilistic Assessment of Clear-Water Scour Around Bridge Piers” (Structural 
Safety 2017: 11-22): 


f(x) = 04e- 40-1) =x > 10 


The graph of f(x) is given in Figure 4.4; there is no density associated with flow rates below 10 m*/s, 
because such flow rates are deemed nonhazardous under these particular conditions. The flow rate 
density decreases rapidly (exponentially fast) as x increases from 10. Clearly fix) > 0; to show that 
Jf (x)dx = 1, we use calculus: 


foe) 10 fore) foe) 
—.04x | °° 
/ f(x)dx = / Odx+ i 04e— 0-19) dy — O4e4 i: e dx = O4e* © a 
= 10 
—oo —0o 10 10 
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Figure 4.4 The density curve for flood rate in Example 4.5 


According to this model, the probability that flood rate is at most 50 m°/s is 


50 50 


P(X <50) = ic f(x)dx = / 04e7 40-19) dy = 04e4 / e dx = O4e*. © 


—.04] 45 
10 10 


_ ef (-e- 450) fin) = .798 


—.04x 50 


Similarly, the probability the flood rate hits at least 200 m°/s, the point at which a nearby bridge will 
collapse, is 


P(X > 200) = / 04e—- 42-19) dx — .0005 


200 


Since X is a continuous rv, .0005 also equals P(X > 200), the probability that the flood rate exceeds 
200 m*/s. The difference between these two events is - = 200}, i.e., that flood rate is exactly 200, 
which has probability zero: P(X = 200) = ow f(x)dx = 

This last statement may feel uncomfortable to you: Is is really zero chance that the flood rate is 
exactly 200 m°/s? If flow rate is treated as continuous, then “exactly 200” means X = 200.000..., 
with an endless repetition of Os. That is to say, X is not rounded to the nearest tenth or even 
hundredth; we are asking for the probability that X equals one specific number, 200.000..., out of the 
(uncountably) infinite collection of possible values of X. i: 


Unlike discrete distributions such as the binomial, hypergeometric, and negative binomial, the 
distribution of any given continuous rv cannot usually be derived using simple probabilistic argu- 
ments (with a few notable exceptions). Instead, one must make a judicious choice of pdf based on 
prior knowledge and available data. Fortunately, some general pdf families have been found to fit well 
in a wide variety of experimental situations; several of these are discussed later in the chapter. 
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Just as in the discrete case, it is often helpful to think of the population of interest as consisting of 
X values rather than individuals or objects. The pdf is then a model for the distribution of values in 
this numerical population, and from this model various population characteristics (such as the mean) 
can be calculated. 

Several of the most important concepts introduced in the study of discrete distributions also play 
an important role for continuous distributions. Definitions analogous to those in Chapter 3 involve 
replacing summation by integration. 


The Cumulative Distribution Function 

The cumulative distribution function (cdf) F(x) for a discrete rv X gives, for any specified number x, 
the probability P(X < x). Itis obtained by summing the pmf p(y) over all possible values y satisfying 
y < x. The cdf of a continuous rv gives the same probabilities P(X < x) and is obtained by 
integrating the pdf f(y) between the limits —oo and x. 


DEFINITION The cumulative distribution function F(x) for a continuous rv X is defined 
for every number x by 


F(x) = P(X<x) = , Fly)dy 


For each x, F(x) is the area under the density curve to the left of x. This is illustrated in Figure 4.5, 
where F(x) increases smoothly as x increases. 
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Figure 4.5 A pdf and associated cdf 


Example 4.6 Let X, the thickness of a membrane, have a uniform distribution on [A, B]. The density 
function is shown in Figure 4.6. For x < A, F(x) = 0, since there is no area under the graph of the 
density function to the left of such an x. Forx > B, F(x) = 1, since all the area is accumulated to the 
left of such an x. Finally, for A < x < B, 
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Figure 4.6 The pdf for a uniform distribution 
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The entire cdf is 
0 x<A 
x 
F(x) = A<x<B 
1 x>B 
The graph of this cdf appears in Figure 4.7. 
F(x) 
1 
; > 
A B x 
Figure 4.7. The cdf for a uniform distribution | 


Using F(x) to Compute Probabilities 
The importance of the cdf here, just as for discrete rvs, is that probabilities of various intervals can be 


computed from a formula or table for F(x). 


PROPOSITION Let X be a continuous rv with pdf f(x) and cdf F(x). Then for any number a, 


P(X >a) =1- F(a) 
and for any two numbers a and b with a < b, 
P(a<X <b) = F(b) — F(a) 
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Figure 4.8 illustrates the second part of this proposition; the desired probability is the shaded area 
under the density curve between a and b, and it equals the difference between the two shaded 
cumulative areas. This is different from what is appropriate for a discrete integer-valued rv (e.g., 
binomial or Poisson): Pia < X < b) = F(b) — F(a — 1) when a and D are integers. 


fix) 


b a 


Figure 4.8 Computing Pia < X < b) from cumulative probabilities 


Example 4.7 Suppose the pdf of the magnitude X of a dynamic load on a bridge (in newtons) is 
given by 


For any number x between 0 and 2, 


x 


i 1 3 x 3x? 
F(x) = [ so [ (G+ py) -3+3 


0 


Thus 
0 5 x<0 
x 3x 
F(x)= ¢ = 4 = <x<2 
(x) stig O<75 
2<x 


The graphs of f(x) and F(x) are shown in Figure 4.9. The probability that the load is between 1 and 
1.5 N is 


Six) 4 F(x) 4 


Figure 4.9 The pdf and cdf for Example 4.7 


198 4 Continuous Random Variables and Probability Distributions 


P(1<X<1.5) = F(1.5) — F(1) = [grs)+ (15) = [+ 50) | a 297 


The probability that the load exceeds 1 N is 


P(X > 1) =1-—P(X<1) =1-F(1)=1- Fo ai =a" 688 = 


The beauty of the cdf in the continuous case is that once it is available, any probability involving 
X can easily be calculated without any further integration. 


Obtaining f(x) from F(x) 

For X discrete, the pmf is obtained from the cdf by taking the difference between two F(x) values. The 
continuous analog of a difference is a derivative. The following result is a consequence of the 
Fundamental Theorem of Calculus. 


PROPOSITION If X is a continuous rv with pdf f(x) and cdf F(x), then at every x at 
which the derivative F’(x) exists, F’(x) = f(x). 


Example 4.8 (Example 4.7 continued) The cdf in Example 4.7 is differentiable except at x = 0 and 
x = 2, where the graph of F(x) has sharp corners. Since F(x) = 0 for x < 0 and F(x) = 1 for x > 2, 
F'(x) = 0 = f) for such x. For 0 < x < 2, 


Percentiles of a Continuous Distribution 

When we say that an individual’s test score was at the 85th percentile of the population, we mean that 
85% of all population scores were below that score and 15% were above. Similarly, the 40th 
percentile is the score that exceeds 40% of all scores and is exceeded by 60% of all scores. 


DEFINITION Let p be a number between 0 and 1. The (100p)th percentile (equivalently, the 
pth quantile) of the distribution of a continuous rv X, denoted by 7, is defined by 


Np 


p=F(n,) = / fO)dy (4.2) 


—Co 
Assuming we can find the inverse of F(x), this can also be written as 
Np = F~'(p) 


In particular, the median of a continuous distribution is the 50th percentile, 775 
or F'(.5). That is, half the area under the density curve is to the left of the 
median and half is to the right of the median. We will also denote the median 
of a distribution by pL. 
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According to Expression (4.2), 7, is that value on the measurement axis such that 100p% of the area 
under the graph of f(x) lies to the left of 7, and 100(1 — p)% lies to the right. Thus 775, the 75th 
percentile, is such that the area under the graph of f(x) to the left of 7.75 is .75. Figure 4.10 illustrates 
the definition. 
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Figure 4.10 The (100p)th percentile of a continuous distribution 


Example 4.9 The distribution of the amount of gravel (in tons) sold by a construction supply 
company in a given week is a continuous rv X with pdf 


(1-2?) O0<x<1 


N| Ww 


f(x) = 


The cdf of sales for any x between 0 and | is 


x 


Fa) = [Z0-yay=3(> =) 


0 


The graphs of both f(x) and F(x) appear in Figure 4.11. 
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Figure 4.11 The pdf and cdf for Example 4.9 


The (100p)th percentile of this distribution satisfies the equation 


3 " 
p = F(n,) i) »-3 
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that is, 
3 
Np — 3n, + 2p = 0 


For the median ji = 7.5, p = .5 and the equation to be solved is f° — 3j1+1 = 0; the solution is 
ft = .347. If the distribution remains the same from week to week, then in the long run 50% of all 
weeks will result in sales of less than .347 tons and 50% in more than .347 tons. | 


A continuous distribution whose pdf is symmetric—which means that the graph of the pdf to the 
left of some point is a mirror image of the graph to the right of that point—has median ju equal to the 
point of symmetry, since half the area under the curve lies to either side of this point. Figure 4.12 
gives several examples. The amount of error in a measurement of a physical quantity is often assumed 
to have a symmetric distribution. 


f(x) I) S() 


~~] 


ii 


— 


Figure 4.12 Medians of symmetric distributions 


Exercises: Section 4.1 (1-17) 


1. Let X denote the amount of time for which 
a book on 2-h reserve at a college library is 
checked out by a randomly selected student 
and suppose that X has density function 


f(x) = .09375(4—x7) -2<x<2 


. Sketch the graph of f(x). 
. Compute P(X > 0). 
. Compute P(-1 < X < 1). 


a 
b 
c 
d. Compute P(X < —.5 or X > .5). 


f(x) = .5x O0<x<2 


Calculate the following probabilities: 


a. P(X < 1) 
b. P(.5 < X < 1.5) 
c. P(.5 < X) 


. Suppose the reaction temperature X (in °C) 
in a chemical process has a uniform distri- 
bution with A = —5 and B = 5. 


a. Compute P(X < 0). 

b. Compute P(—2.5 < X < 2.5). 

c. Compute P(-2 < X < 3). 

d. For k satisfying -5<k<k+4<5, 
compute P(k < X < k + 4). Interpret this 
in words. 


. Suppose the error involved in making a 
measurement is a continuous rv X with pdf 


. Let X denote the power (MW) generated by 


a wind turbine at a given wind speed. The 
article “An Investigation of Wind Power 
Density Distribution at Location With Low 
and High Wind Speeds Using Statistical 
Model” (Appl. Energy 2018: 442-451) 
proposes the Rayleigh distribution, with pdf 


f(x; 0) = ee e/2°) y>0 


as a model for the X distribution. The value 
of the parameter @ depends upon the 
prevailing wind speed. 


a. Verify that f(x; 0) is a legitimate pdf. 

b. Suppose @ = 100. What is the proba- 
bility that X is at most 200? Less than 
200? At least 200? 
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c. What is the probability that X is between 
100 and 200 (again assuming 6 = 100)? 
d. Give an expression for the cdf of X. 


5. A college professor never finishes his lec- 


ture before the end of the hour and always 
finishes his lecture within 2 min after the 
hour. Let X = the time that elapses between 
the end of the hour and the end of the 
lecture and suppose the pdf of X is 


fixi=ke? 0<x<2 


a. Find the value of k. [Hint: Total area 
under the graph of f(x) is 1.] 

b. What is the probability that the lecture 
ends within | min of the end of the 
hour? 

c. What is the probability that the lecture 
continues beyond the hour for between 
60 and 90 s? 

d. What is the probability that the lecture 
continues for at least 90 s beyond the 
end of the hour? 


. The grade point averages (GPAs) for 
graduating seniors at a college are dis- 
tributed as a continuous rv X with pdf 


f(x) =k[L- (x -3)"] 2<x<4 


a. Sketch the graph of f(x). 

b. Find the value of k. 

c. Find the probability that a GPA exceeds 
3. 

d. Find the probability that a GPA is within 
25 of 3. 

e. Find the probability that a GPA differs 
from 3 by more than .5. 


. The time X (min) for a laboratory assistant 
to prepare the equipment for a certain 
experiment is believed to have a uniform 
distribution with A = 25 and B = 35. 


a. Write the pdf of X and sketch its graph. 

b. What is the probability that preparation 
time exceeds 33 min? 

c. What is the probability that preparation 
time is within 2 min of the median time? 
[Hint: Identify f2 from the graph of f(x).] 


8. 


10. 


d. Forany asuch that25 <a<a+2< 35, 
what is the probability that preparation 
time is between a and a + 2 min? 


Commuting to work requires getting on a 
bus near home and then transferring to a 
second bus. If the waiting time (in minutes) 
at each stop has a uniform distribution with 
A = 0 and B = 5, then it can be shown that 
the total waiting time Y has the pdf 


= y/25 0<y<5 
fy) = taj 5<y<10 


a. Sketch the pdf of Y. 

b. Verify that [* f(y)dy = 1. 

c. What is the probability that total waiting 
time is at most 3 min? 

d. What is the probability that total waiting 
time is at most 8 min? 

e. What is the probability that total waiting 
time is between 3 and 8 min? 

f. What is the probability that total waiting 
time is either less than 2 min or more 
than 6 min? 


. Consider again the rv X = hazardous flood 


rate given in Example 4.5. What is the 
probability that the flood rate is 


a. At most 40 m?/s? 
b. More than 40 m*/s? At least 40 m?/s? 
c. Between 40 and 60 m?/s? 


A family of pdfs that has been used to 
approximate the distribution of income, city 
population size, and size of firms is the 
Pareto family. The family has two param- 
eters, k and 0, both > 0, and the pdf is 


k- OF 


FOG k, 8) = x>0 


a. Sketch the graph of f(x; k, 0). 

b. Verify that the total area under the graph 
equals 1. 

c. If the rv X has pdf f(x; k, 0), for any fixed 
b > 0, obtain an expression for P(X < b). 

d. For 0 < a < b, obtain an expression for 
the probability Pia < X < b). 
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11. 


12. 


13. 


e. Find an expression for the (100p)th 
percentile 7,,. 


The cdf of checkout duration X as described 
in Exercise 1 is 


0 x<0 
F(x) = <4 x°/4 O0<x<2 
1 2<x 


Use this to compute the following: 


. PX < 1) 

. POS < X < 1) 

. P(X > .5) 

. The median checkout duration [Hint: 
Solve F(jt) = .5.] 

e. F(x) to obtain the density function f(x) 


a 
b 
c 
d 


The cdf for X = measurement error of 
Exercise 3 is 


0 x< —2 
F(x) =¢ 543(4x—23/3)/32 -2<x<2 
1 2<x 
a. Compute P(X < 0). 
b. Compute P(-1 < X < 1). 
c. Compute P(.5 < X). 
d. Verify that f(x) is as given in Exercise 3 


by obtaining F’(x). 
e. Verify that fi = 0. 


Suppose that in a certain traffic environ- 
ment, the distribution of X =the time 
headway (sec) between two randomly 
selected consecutive cars has the form 


a. Determine the value of k for which 
J(x) is a legitimate pdf. 

b. Obtain the cumulative 
function. 

c. Use the cdf from (b) to determine the 
probability that headway exceeds 2 s 
and also the probability that headway is 
between 2 and 3 s. 


distribution 
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14. 


15. 


16. 


17. 


Let X denote the amount of space occupied 
by an article placed in a Li packing 
container. The pdf of X is 


f(x) =90x8(1—x) O<x<1 


a. Graph the pdf. Then obtain the cdf of 
X and graph it. 

b. What is P(X < .5) [ie., F(.5)]? 

c. Using part (a), what is P(.25 < X < .5)? 
What is P(.25 < X < .5)? 

d. What is the 75th percentile of the dis- 
tribution? 


Answer parts (a)—(d) of Exercise 14 for the 
random variable X, lecture time past the 
hour, given in Exercise 5. 

Let X be a continuous rv with cdf 


0) x<0 
F(x) = ¢ x[14+ In(4/x)|/4 O0<x<4 
1 x>4 


[This type of cdf is suggested in the article 
“Variability in Measured Bedload- 
Transport Rates” (Water Resources Bull. 
1985:39-48) as a model for a hydrologic 
variable.] What is 


a. P(X < 1)? 
b. PU < X < 3)? 
c. The pdf of X? 


Let X be the temperature in °C at which a 
chemical reaction takes place, and let Y be 
the temperature in °F (so Y = 1.8X + 32). 


a. If the median of the X distribution is j, 
show that 1.8/.+ 32 is the median of the 
Y distribution. 

b. How is the 90th percentile of the Y dis- 
tribution related to the 90th percentile of 
the X distribution? Verify your 
conjecture. 

c. More generally, if Y= aX + b, how is 
any particular percentile of the Y distri- 
bution related to the corresponding per- 
centile of the X distribution? 
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In Section 4.1 we saw that the transition from a discrete cdf to a continuous cdf entails replacing 
summation by integration. The same thing is true in moving from expected values and mgfs of 
discrete variables to those of continuous variables. 


Expected Values 

For a discrete random variable X, E(X) was obtained by summing x - p(x) over possible X values. 
Here we replace summation by integration and the pmf by the pdf to get a continuous weighted 
average. 


DEFINITION The expected or mean value of a continuous rv X with pdf f(x) is 
= My = E(X) = : x-f (x)dx 


This expected value will exist provided that [°° |x|f(x)dx<oo. In practice, the 
limits of integration are specified by the support of the pdf (since f(x) = 
otherwise). 


Example 4.10 (Example 4.9 continued) The pdf of weekly gravel sales X was 
3 2 
f@)=5(1-#) O<x<1 


so 


x=1 


3 
==> 375 
8 


If gravel sales are determined week after week according to the given pdf, then the long-run average 
value of sales per week will be .375 ton. i] 


Similar to the interpretation in the discrete case, the mean value uw can be regarded as the balance 
point (or fulcrum or center of mass) of a continuous distribution. In Example 4.10, if a piece of 
cardboard was cut out in the shape of the region under the density curve f(x), then it would balance if 
supported at “ = 3/8 along the bottom edge. When a pdf f(x) is symmetric, then it will balance at its 
point of symmetry, which must be the mean uw (assuming p exists). Recall from Section 4.1 that the 
median is also the point of symmetry; in general, if a distribution is symmetric and the mean exists, 
then it is equal to the median. 

Often we wish to compute the expected value of some function h(X) of the rv X. If we think of 
h(X) as a new rv Y, methods from Section 4.7 can be used to derive the pdf of Y, and E(Y) can be 
computed from the definition. Fortunately, as in the discrete case, there is an easier way to compute 


E{h(X)]. 
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LAW OF THE If X is a continuous rv with pdf f(x) and h(X) is any function of X, then 
UNCONSCIOUS a 
STATISTICIAN 
po) = EW] =f h(x) Fla 
This expected value will exist provided that [°° |h(x)|f(x)dx<oo. 


Importantly, except in the cases where h(x) is a linear function (see later in this section), E[i(X)] is not 
equal to h(ux), the function h evaluated at the mean of xX. 


Example 4.11 The variation in a certain electrical current source X (in milliamps) can be modeled 
by the pdf 
f(x) =1.25—.25x 2<x<4 


The average current from this source is 


4 
17 
= [02 — .25x)dx = = 2.833 mA 
2 


If this current passes through a 220-ohm resistor, the resulting power (in microwatts) is given by the 
expression h(X) = (current)*(resistance) = 220X’. The expected power is given by 


5500 


E(h(X)) = E(220X*) = | 220x°(1.25 — .25x)dx = —— = 1833.3 microwatts 


en 


Notice that the expected power is not equal to 220(2.833)°, a common error that results from 
substituting the mean current jy into the power formula. a 


Example 4.12 Two species are competing in a region for control of a limited amount of a resource. 
Let X = the proportion of the resource controlled by species 1 and suppose X has pdf 


f(x) =1 O<x<l 


which is the uniform distribution on [0, 1]. (in her book Ecological Diversity, E. C. Pielou calls this 
the “broken-stick” model for resource allocation, since it is analogous to breaking a stick at a 
randomly chosen point.) Then the species that controls the majority of this resource controls the 
amount 


1-x if 0<xX<.5 


h(x) = max(X,1 =X) = { X- Ait SSeS 


The expected amount controlled by the species having majority control is then 
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E|h(X)| = ‘i max(x, 1 — x) -f(x)dx = lee 1—x)-ldx 
—00 0 
= [Oates fering = 


In the discrete case, the variance of X was defined as the expected squared deviation from u and was 
calculated by summation. Here again integration replaces summation. 


DEFINITION The variance of a continuous random variable X with pdf f(x) and mean 
value pL is 


The standard deviation of X is SD(X) = ox = \/V(X). 


As in the discrete case, a, is the expected or average squared deviation about the mean , and ox 
can be interpreted roughly as the size of a representative deviation from the mean value i. 


Example 4.13 Let X ~ Unif[A, B]. Since a uniform distribution is symmetric, the mean of X is at 
the density curve’s point of symmetry, which is clearly the midpoint (A + B)/2. This can be verified 
by integration: 


1 B?-A? A+B 


[BAS 2 


3 2 
1 1 x 
= : d. 
e [eae 


The variance of X is then given by 


(B—A)/2 
1 2 ; A+B 
~RB_-A / u-du substitute u = x — a 
~(B-A)/2 
(B—A)/2 
a er udu symmetry 
0 
oe 3 eed 2 (B — ay - (BA? 
~ B-A3 | ~B-A 2.3 12 
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The standard deviation of X is the square root of the variance: o = (B — Ay/V‘12. Notice that the 
standard deviation of a Unif[A, B] distribution is proportional to the length of the interval, B — A, 
which matches our intuitive notion that a larger standard deviation corresponds to greater “spread” in 
a distribution. i 


Section 3.3 presented several properties of expected value, variance, and standard deviation for 
discrete random variables. Those same properties hold for the continuous case; proofs of these results 
are obtained by replacing summation with integration in the proofs presented in Chapter 3. 


PROPOSITION Let X be a continuous rv with pdf f(x), mean mw, and standard deviation o. Then 
the following properties hold. 


1. (variance shortcut) 
ie) oe) 2 
V(X) = E(X’) — = / x» f(x)dx — © x f(s) 


2. (linearity of expectation) For any functions h,(X) and h2(X) and any constants 
a, A>, and b, 


Elayhy(X) + agh2(X) +b] = ay Eli (X)] + Elio (X)|] +b 
3. (rescaling) For any constants a and b, 


E(aX +b) =aE(X)+b V(aX+b)=a’o% Gax+0 = |alox 


Example 4.14 (Example 4.10 continued) For X = weekly gravel sales, we computed E(X) = 3/8. 


Since 
(ove) 1 1 
2 2 23 2 3 2_ 44 i 
E(X*) = x f(x)dx = x5 (Lx dx =5 (x x')dx ==, 
09 0 0 


V(X) = 1/5 — (3/8)? = 19/320 = .059 and oy = V.059 = .244 


Suppose the amount of gravel actually received by customers in a week is h(X) = X — .02X”; the 
second term accounts for the small amount that is lost in transport. Then the average weekly amount 
received by customers is 


E(X — .02X”) = E(X) — .02E(X) = .375 — .02(.2) = .371 tons a 


Example 4.15 When a dart is thrown at a circular target, consider the location of the landing point 
relative to the bull’s eye. Let X be the angle in degrees measured from the horizontal, and assume that 
X is uniformly distributed on [0, 360). By Example 4.13, E(X) = 180 and ox = 360/12. Define Y to 
be the angle measured in radians between —z and z, so Y = (27/360)X — x. Then, applying the 
rescaling properties with a = 27/360 and b = —z, 
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RNS Ea 1 

~ 360 *~ 360 am 

and 
eee T ae 27 360 ~ 2 = 
* (360; ~~ 360,12. 12 


As a special case of the result E(aX + b) = aE(X) + b, set a= 1 and b= —np, giving E(X — p) = 
E(X) — »=0. This can be interpreted as saying that the expected deviation from yp is 0; 
J. (« — wf (x)dx = 0. The integral suggests a physical interpretation: With (x — 1) as the lever arm 
and f(x) as the weight function, the total torque is 0. Using a seesaw as a model with weight 
distributed in accord with f(x), the seesaw will balance at w. 


Approximating the Mean Value and Standard Deviation 

Let X be a random variable with mean value mw and variance o°. Then we have already seen that the 
new random variable Y = h(X) = aX + b, a linear function of X, has mean value ay + b and variance 
a’o. But what can be said about the mean and variance of Y if h(x) is a nonlinear function? 


PROPOSITION Suppose A(x) is differentiable and that its derivative evaluated at j: satisfies 

(The Delta Method) = h’(y:) 4 0. Then if the variance of X is small, so that the distribution of X is 
largely concentrated on an interval of values close to yu, the mean value and 
variance of Y = h(X) can be approximated as follows: 


Elh(X)] © h(u),  V{h(X)] © [A (w)]°o? 


The justification for these approximations is a first-order Taylor series expansion of h(X) about wu; that 
is, we approximate the function for values near by the tangent line to the function at the point 


(u, h(u)): 

¥ = W(X) © hw) +H (W(X = 1) 
Taking the expected value of this gives E[h(X)] ~ A(u). Since h(w) and A’() are numerical con- 
stants, the variance of the linear approximation is V[h(X)] + 0+ [h'(w)°V(X — ») = [h'(u)o?. 


Example 4.16 A chemistry student determined the mass m and volume X of an aluminum chunk and 
took the ratio to obtain the density Y = h(X) = m/X. The mass is measured much more accurately, so 
for an approximate calculation it can be regarded as a constant. The derivative of h(X) is —m/X’, so 


Hl? 
Dalles 2 
anf] 


The standard deviation is then oy + [m/ ira ox. A particular aluminum chunk had measurements 
m= 18.19 g and X= 6.6 cm®, which gives an estimated density Y = m/X = 18.19/6.6 = 2.76. 
A rough value for the standard deviation of X is cy = .3 cm’. Our best guess for the mean of the 
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X distribution is the measured value, so Wy ~ h(x) = 18.19/6.6 = 2.76, and the estimated standard 
deviation for the estimated density is 


Compare the estimate of 2.76, standard deviation .125, with the official value 2.70 for the density of 
aluminum. fi 


Moment Generating Functions 
Moments and moment generating functions for discrete random variables were introduced in Sec- 
tion 3.4. These concepts carry over to the continuous case. 


DEFINITION The moment generating function (mgf) of a continuous random variable X is 
Mx(t) = E(e™) = / ef (x)dx. 


As in the discrete case, the moment generating function exists if Mx(f) is 
defined for an interval that includes zero as well as positive and negative 
values of t. 


Just as before, when t = 0 the value of the mgf is always 1: 


Mx(0) = E(e%*) = / ef (x)dx = J feyac=1 


Example 4.17 At a store, the checkout time X in minutes has the pdf f(x) = de x > 0. Then 


my(t) =f erpode= f ev2e* ax = f eas 
—0oo 0 0 


i 2 2 : —(2-t)x 


The limit above exists (in fact, it equals zero) provided the coefficient on x is negative, i.e., 
—(2 — t) < 0. This is equivalent to t < 2. The mgf exists because it is defined for an interval of values 
including O in its interior, specifically (—oo, 2). For ¢ in that interval, the mgf of X is Mx(A) = 
2/(2 — t). Notice that My(0) = 2/(2 — 0) = 1, as it must be. isi} 


The uniqueness property for the mgf of a discrete rv is equally valid in the continuous case. Two 
distributions have the same pdf if and only if they have the same moment generating function, 
assuming that the mgf exists. For example, if a random variable X is known to have mgf Mx(A) = 
2/(2 — t) for t < 2, then from Example 4.17 it must necessarily be the case that the pdf of X is 
f(x) = 2e-** for x > 0 and f(x) = 0 otherwise. 
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In the discrete case we had a theorem on how to get moments from the mgf, and this theorem 
applies also in the continuous case: the rth moment of a continuous rv with mgf M(t) is given by 


E(X") = MY? (0), 


the rth derivative of the mgf with respect to ¢ evaluated at ¢ = 0, if the mgf exists. 


Example 4.18 (Example 4.17 continued) The mgf of the rv X = checkout time at the store was 
found to be M,(t) = 2/(2 — t) = 2(2 — t) ' for t < 2. To find the mean and standard deviation, first 
compute the derivatives: 


My) = 20-7") = 
Mi =5, 2-97] =-42-9*-) = 
Setting ¢ to 0 in the first derivative gives the expected checkout time as 
E(X) = MY (0) = M.(0) = 5 min. 
Setting ¢ to 0 in the second derivative gives the second moment 
E(X*) = My? (0) = My(0) = 5, 
from which V(X) = E(X2) — [E(X)]? = .5 — .5? = .25 and o = V.25 = 5 min. a 


As in the discrete case, if X has the mgf M,(f) then the megf of the linear function Y = aX + b is 
Myt) = e"Mx(at). 


Example 4.19 Let X have a uniform distribution on the interval [A, B], so its pdf is f(x) = 1/(B — A), 
A < x < B; f(x) = 0 otherwise. As verified in Exercise 32, the moment generating function of X is 


eBt _ At 
Mx(t)={ (Boay ‘7° 
1 t=0 


In particular, consider the situation in Example 4.15. Let X, the angle measured in degrees, be uniform 
on [0, 360], so A = O and B = 360. Then 


23601 4 


Mx() =, 


t#0, My(0)=1 


Now let Y = (27/360)X — zm, so Y is the angle measured in radians and Y is between —7z and z. Using 
the mgf rule for linear transformations with a = 27/360 and b = —7, 
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2 
My(t) = e'My(at) = e-™My & ) 


360 


2360(2n/360)t 4 


360(25 1) 


Tt —T1t 


t#0, 


My(0) =1 


This matches the general form of the moment generating function for a uniform random variable with 
A =-—z and B = az. Thus, by the uniqueness principle, Y ~ Unif[—z, z]. i | 


Exercises: Section 4.2 (18-38) 


18. 


19. 


20. 


Reconsider the distribution of checkout 
duration X described in Exercises | and 11. 
Compute the following: 


a. E(X) 

b. V(X) and oy 

c. If the borrower is charged an amount 
h(X) = X? when checkout duration is X, 
compute the expected charge E[h(X)]. 


Recall the distribution of hazardous flood 
rate used in Example 4.5. 


a. Obtain the mean and standard deviation 
of this distribution. 

b. What is the probability that the flood 
rate is within 1 standard deviation of the 
mean value? 


The article “Forecasting Postflight Hip 
Fracture Probability Using Probabilistic 
Modeling” (J. Biomech. Engr. 2019) 
examines the risk of bone breaks for astro- 
nauts returning from space, who typically 
lose density during missions. One quantity 
the article’s authors model is the midpoint 
fracture risk index (mFRI), the ratio of 
applied load to bone strength at which the 
chance of a fracture is 50-50. The arti- 
cle suggests a uniform distribution on 
(0.55, 1.45] to model this unitless index 
value. 


a. Calculate the mean and standard deviation 
of mFRI using the specified model. 
b. Determine the cdf of mFRI. 


21. 


22. 


23. 


24. 


c. What is the probability that mFRI is less 
than 1? Between 0.75 and 1.25? 

d. What is the probability that mFRI is 
within one standard deviation of its 
expected value? Within two standard 
deviations? 

For the distribution of Exercise 14, 


a. Compute E(X) and ox. 

b. What is the probability that X is more 
than 1 standard deviation from its mean 
value? 


Consider the pdf of X = grade point aver- 
age given in Exercise 6. 


a. Obtain and graph the cdf of X. 
b. From the graph of f(x), what is 1? 
c. Compute E(X) and V(X). 


Let X have a uniform distribution on the 

interval [A, B]. 

a. Obtain an expression for the (100p)th 
percentile. 

b. Obtain an expression for the median, ji. 
How does this compare to the mean y, 
and why does that make sense for this 
distribution? 

c. For n a positive integer, compute E(X”). 

Consider the pdf for total waiting time Y for 

two buses 


_ 04y O<y<5 
oe 5<y<10 


introduced in Exercise 8. 
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25. 


26. 


27. 


a. Compute and sketch the cdf of Y. [Hint: 
Consider separately 0 < y<5 and 
5 < y < 10incomputing F(y). A graph 
of the pdf should be helpful.] 

b. Obtain an expression for the (100p)th 
percentile. [Hint: Consider separately 
O<p<.5and 5 < p<lJ 

c. Compute E(Y) and V(Y). How do these 
compare with the expected waiting time 
and variance for a single bus when the 
time is uniformly distributed on [0, 5]? 

An ecologist wishes to mark off a circular 

sampling region having radius 10m. 

However, the radius of the resulting region 

is actually a random variable R with pdf 


3 
f(r) = Zl - (0 - ry] 9<r<il 
What is the expected area of the resulting 
circular region? 


The weekly demand for propane gas (in 
1000s of gallons) from a particular facility 
is an rv X with pdf 


a. Compute the cdf of X. 

b. Obtain an expression for the (100p)th 
percentile. What is the value of ju? 

c. Compute E(X). How do the mean and 
median of this distribution compare? 

d. Compute V(X) and ox. 

e. If 1.5 thousand gallons are in stock at the 
beginning of the week and no new 
supply is due in during the week, how 
much of the 1.5 thousand gallons is 
expected to be left at the end of the 
week? [Hint: Let h(x) = amount left 
when demand is x.] 


If the temperature at which a compound 
melts is a random variable with mean 
value 120 °C and standard deviation 2 °C, 
what are the mean temperature and 
standard deviation measured in °F? [Hint: 
°F = 1.8 °C + 32.] 


28. 


29. 


30. 


31. 
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Let X have the Pareto pdf introduced in 
Exercise 10. 


k- 6 
f(ask, 0) = xkt1 


a. If k > 1, compute F(X). 

b. What can you say about E(X) if k = 1? 

c. If k>2, show that V(X)= 
k@?(k — 1)-?(k — 2)71. 

d. If k = 2, what can you say about V(X)? 

e. What conditions on k are necessary to 
ensure that E(X”) is finite? 


The time (min) between successive visits to 
a particular website has pdf f(x) = 4e™, 
x > 0; f(x) = 0 otherwise. Use integration 
by parts to obtain E(X) and V(X). 
Suppose that the pdf of X is 
f(x) =. 0<x<4 
a. Show that E(X) = 4/3 and V(X) = 8/9. 
b. The coefficient of skewness is defined as 
E(x - py Vo’. Show that its value for 
the given pdf is .566. What would the 
skewness be for a perfectly symmetric 
pdf? Explain your reasoning. 


a. If the voltage v across a medium is fixed 
but current J is random, then resistance 
will also be a random variable related to 
I by R = vi. If py = 20 and a; = .5, use 
the delta method to calculate approxi- 
mations to [lr and Gp. 

b. Let R have the distribution in Exercise 
25, whose mean and variance are 10 and 
1/5, respectively. Let h(R) = mR’, the 
area of the ecologist’s sampling region. 
How does E[A(R)] from Exercise 25 
compare to the delta method approxi- 
mation h(10)? 

c. The variance of the region’s area is 
V[A(R)] = 1400827/175. Compute the 
delta method approximation to V[A(R)]. 
How good is the approximation? 
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Let X have a uniform distribution on the 
interval [A, B], so its pdf is f(x) = 1/(B — A), 
A <x < B, fx) = 0 otherwise. Show that 
the moment generating function of X is 


eBt = et 
Oe O* Bray 


t#0 

Let X ~ Unif[0, 1]. Find a linear function 
Y = g(X) such that the interval [0, 1] is 
transformed into [—5, 5]. Use the relationship 
for linear functions M,x.,(t) = e’“Mx(at) to 
obtain the mef of Y from the mgf of X. Com- 
pare your answer with the result of Exercise 
32, and use this to obtain the pdf of Y. 


If the pdf of a measurement error X is 
fx) = 5e*!, —co<x<oco show _ that 

1 
M x(t) = i-z 
Consider the rv X = hazardous flood rate in 
Example 4.5. 


for |t} <1. 


a. Find the moment generating function and 
use it to find the mean and variance. 

b. Now consider a random variable whose 
pdf is 


f(x) = 04e°%* x >0 


Find the moment generating function 
and use it to find the mean and variance. 
Compare with (a), and explain the sim- 
ilarities and differences. 

c. Let Y= X — 10 and use the relationship 
for linear functions M,.x.,(t) = e’“My(at) 
to obtain the mgf of Y from (a). Compare 
with the result of (b) and explain. 


Define R(t) = In[My(H]. It was shown in 

Chapter 3 that R(t) = E(X) and RX(t) = 

V(X). 

a. Determine M,(f) for the pdf in Exercise 
29, and use this mgf to obtain E(X) and 
V(X). How does this compare, in terms 


37. 


38. 


of difficulty, with the integration by parts 
required in that exercise? 

b. Determine R,(t) for this same distribu- 
tion, and use Rx(f) to obtain E(X) and 
V(X). How does the computational effort 
here compare with that of (a)? 


Let X be a nonnegative, continuous rv with 
pdf f(x) and cdf F(x). 
a. Show that, for any constant ¢> 0, 


fx-f(xjdx>t- P(X >t) =t-[1- F(d)] 
i 

b. Assume the mean of X is finite (1.e., the 
integral defining ~ converges). Use part 


(a) to show that 


lim t- [1 — F(t)] =0 
t-00 
Let X be a nonnegative, continuous rv with 
cdf F(x). 
a. Assuming the mean uw of X is finite, 
show that 


w= | (L—F(x)ldx 
| 


[Hint: Apply integration by parts to the 
integral above, and use the result of the 
previous exercise. ] 

b. A similar argument can be used to show 
that the Ath moment of X is given by 


Co 


E(X*) = | x11 — F(x)]dx 


0 


and that E(X*) exists iff [1 — F()] — 0 
as t — oo. (This was the topic of a 2012 
article in The American Statistician.) 
Suppose the lifetime X, in weeks, of a 
low-grade transistor under continuous 
use has cdf F(x) =1-(x+1)° for 
x > 0. Without finding the pdf of X, 
determine its mean and its standard 
deviation. 
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4.3. The Normal Distribution 


The normal distribution is the most important one in all of probability and statistics. Many numerical 
populations have distributions that can be fit very closely by an appropriate normal curve. Examples 
include heights, weights, and other physical characteristics, measurement errors in scientific exper- 
iments, measurements on fossils, reaction times in psychological experiments, scores on various tests, 
and numerous economic measures and indicators. Even when the underlying distribution is discrete, 
the normal curve often gives an excellent approximation. In addition, even when individual variables 
themselves are not normally distributed, sums and averages of the variables will under suitable 
conditions have approximately a normal distribution; this is the content of the Central Limit Theo- 
rem discussed in Chapter 6. 


DEFINITION A continuous rv X is said to have a normal distribution with parameters pz and 
a, where —oo<p<co and o > 0, if the pdf of X is 


1 
V 210 


POs —eeeges (4.3) 


f(x 4,0) = 


The statement that X is normally distributed with parameters uw and o will be 
denoted by X ~ Nu, a). 


Figure 4.13 presents graphs of f(x; 1, 7) for several different (1, c) pairs. Each resulting density curve 
is symmetric about y and bell-shaped, so the center of the bell (point of symmetry) is both the mean 
of the distribution and the median. The value of a is the distance from 1 to the inflection points of the 
curve (the points at which the curve changes between turning downward to turning upward). Large 
values of o yield density curves that are quite spread out about uu, whereas small values of o yield 
density curves with a high peak above 4 and most of the area under the density curve quite close to . 
Thus a large o implies that a value of X far from ps may well be observed, whereas such a value is 
quite unlikely when o is small. 


upto Lute Uuto 


Figure 4.13 Normal density curves 


Clearly fix; u, 0) > 0, but a clever calculus argument is required to prove that [ an f(x; b, a)dx = 1 
(see Exercise 66). It can be shown using calculus (Exercise 67) or moment generating functions 
(Exercise 68) that E(X) = uw and V(X) = a”, so the parameters yw and o are the mean and the standard 
deviation, respectively, of X. 
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The Standard Normal Distribution 
To compute Pia < X < b) when X ~ Nu, o), we must evaluate 


b 
1 
/ TE ek ro (4.4) 


None of the standard integration techniques can be used to evaluate (4.4), and there is no closed-form 
expression for (4.4). Table 4.2 at the end of this section provides the code for performing such normal 
calculations in R. For the purpose of hand calculation, we now introduce a special normal distri- 
bution. 


DEFINITION The normal distribution with parameter values = 0 and o = | is called the 
standard normal distribution. A random variable that has a standard normal 
distribution is called a standard normal random variable and will be denoted 
by Z. The pdf of Z, denoted ¢(z), is 


(0) =f(60,1) = ae"? ~0<z<0o 
The cdf of Z is P(Z<z) = b(y)dy = f 1 


ore dy, which we will 
denote by ®(z). 


The standard normal distribution does not frequently serve as a model for a naturally arising pop- 
ulation, since few variables have mean O and standard deviation 1. Instead, it is a reference distri- 
bution from which information about other normal distributions can be obtained. Appendix Table A.3 
gives values of ®(z) for z = —3.49, —3.48, ..., 3.48, 3.49 and is referred to as the standard normal 
table or z table. Figure 4.14 illustrates the type of cumulative area (probability) tabulated in 
Table A.3. From this table, various other probabilities involving Z can be calculated. 


Shaded area = ®(z) 


Standard normal (z) curve 
iy? (2) 


O Zz 


Figure 4.14 Standard normal cumulative areas tabulated in Appendix Table A.3 


Example 4.20 Here we demonstrate how the z table is used to calculate various probabilities 

involving a standard normal rv. 

a. P(Z < 1.25) = ®(1.25), a probability that is tabulated in Table A.3 at the intersection of the row 
marked 1.2 and the column marked .05. The number there is .8944, so P(Z < 1.25) = .8944. See 
Figure 4.15a. In R, we may type pnorm(1.25,0,1) or just pnorm(1.25). 
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a b 
Shaded area = ®(1.25) 


zZ Curve 


Figure 4.15 Normal curve areas (probabilities) for Example 4.20 


b. P(Z > 1.25) = 1 — P(Z < 1.25) = 1 — ®(1.25), the area under the standard normal curve to the 
right of 1.25 (an upper-tail area). Since ®(1.25) = .8944, it follows that P(Z > 1.25) = .1056. 
Since Z is a continuous rv, P(Z > 1.25) also equals .1056. See Figure 4.15b. 

c. PZ < — 1.25) = ®-1.25), a lower-tail area. Directly from the z table, O(—1.25) = .1056. By 
symmetry of the normal curve, this is identical to the probability in (b). 

d. P(-.38 < Z < 1.25) is the area under the standard normal curve above the interval [—.38, 1.25]. 
From Section 4.1, if Z is a continuous rv with cdf F(z), then Pia < Z < b) = F(b) — F(a). This 
gives P(-.38 < Z < 1.25) = ©(1.25) — O(-.38) = .8944 — .3520 = 5424 (see Figure 4.16). 
To evaluate this probability in R, type pnorm(1.25, 0,1)-pnorm(-.38, 0, 1) or just 
pnorm(1.25)-pnorm(-.38). 


| i | 
—.38 0 1.25 0 1.25 -.38 0 


Figure 4.16 P(—.38 < Z < 1.25) as the difference between two cumulative areas | 


From Section 4.1, we have that the (100p)th percentile of the standard normal distribution, for any 
p between 0 and 1, is the solution to the equation ®(z) = p. So, we may write the (100p)th percentile 
of the standard normal distribution as 7, = ® '(p). Software or the z table can be used to obtain this 
percentile. 


Example 4.21 The 99th percentile of the standard normal distribution is that value on the horizontal 
axis such that the area under the curve to the left of the value is .9900. Appendix Table A.3 gives for 
fixed z the area under the standard normal curve to the left of z, whereas here we have the area and 
want the value of z. This is the “inverse” problem to P(Z < z) = ? so the table is used in an inverse 
fashion: Find in the middle of the table .9900; the row and column in which it lies identify the 99th 
z percentile. Here .9901 lies in the row marked 2.3 and column marked .03, so the 99th percentile is 
(approximately) z = 2.33 (see Figure 4.17). By symmetry, the first percentile is the negative of the 
99th percentile, so it equals —2.33 (1% lies below the first and above the 99th). See Figure 4.18. 
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Shaded area = .9900 


Z curve 


99th percentile = 2.33 


Figure 4.17 Finding the 99th percentile 


Z curve 


Ae 


Shaded area = .01 


—2.33 = Ist percentile 2.33 = 99th percentile 


Figure 4.18 The relationship between the Ist and 99th percentiles 


To determine the 99th percentile of the standard normal distribution in R, use the command 
qnorm(.99,0,1) or just qnorm(.99). | 


In general, the (100p)th percentile is identified by the row and column of Appendix Table A.3 in 
which the entry p is found (e.g., the 67th percentile is obtained by finding .6700 in the body of the 
table, which gives z = .44). If p does not appear, the number closest to it is often used, although linear 
interpolation gives a more accurate answer. For example, to find the 95th percentile, we look for 
.9500 inside the table. Although .9500 does not appear, both .9495 and .9505 do, corresponding to 
z = 1.64 and 1.65, respectively. Since .9500 is halfway between the two probabilities that do appear, 
we will use 1.645 as the 95th percentile and —1.645 as the 5th percentile. 


z,, Notation 
In statistical inference, we will need the values on the measurement axis that capture certain small tail 
areas under the standard normal curve. 


NOTATION Z, Will denote the value on the measurement axis for which « of the area under 
the z curve lies to the right of z,. That is, z, = ®-'(1 — «) (see Figure 4.19). 
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& CUIVE, Shaded area = P(Z>z) =a 


\ 


z 
a 


Figure 4.19 z, notation illustrated 


For example, zj9 captures upper-tail area .10 and zo; captures upper-tail area .01. 

Since « of the area under the standard normal curve lies to the right of z,, 1 — « of the area lies to 
the left of z, . Thus z, is the 100(1 — «)th percentile of the standard normal distribution. By 
symmetry the area under the standard normal curve to the left of —z,, is also «. The z,’s are usually 
referred to as z critical values. Table 4.1 lists the most useful standard normal percentiles and z,, 
values. 


Table 4.1 Standard normal percentiles and critical values 


Percentile 90 95 97.5 99 99.5 99.9 99.95 
(tail area) l 05 025 O01 005 001 .0005 
Z, = 100(1 — «)th percentile 1.28 1.645 1.96 2.33 2.58 3.08 3.27 


Example 4.22 The 100(1 — .05)th = 95th percentile of the standard normal distribution is z.5, so 
Zo5 = 1.645. The area under the standard normal curve to the left of —z.95 is also .05. See Figure 4.20. 


Zz Curve 


Shaded area = .05 Shaded area = .05 


-1.645 =-z,, Z9s3= 95th percentile = 1.645 


Figure 4.20 Finding zo5 a 
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Nonstandardized Normal Distributions 

When X ~ N(u, a), probabilities involving X may be computed by “standardizing.” A standardized 
variable has the form (X — j)/o. Subtracting y shifts the mean from p to zero; dividing by o scales 
the variable so that the standard deviation is 1 rather than o. 

Standardizing amounts to calculating a distance from the mean value and then re-expressing the 
distance as some number of standard deviations. For example, if = 100 and o = 15, then x = 130 
corresponds to z = (130 — 100)/15 = 30/15 = 2.00. Thus 130 is 2 standard deviations above (i.e., to 
the right of) the mean value. Similarly, standardizing 85 gives (85 — 100)/15 = —1.00, so 85 is 1 
standard deviation below the mean. According to the next proposition, the z table applies to any 
normal distribution provided that we think in terms of number of standard deviations away from the 
mean value. 


PROPOSITION If X ~ N(w, o), then the “standardized” rv Z defined by 


Bao 
Zao 
oO 


has a standard normal distribution. Thus 


Plasx<b) = P(*—#<z<*—*) (74) o(“—*), 
(oy oO oO (oy 


P(X<a)=0(=—*), P(X>8) =1-0(=4), 


o 
and the (100p)th percentile of the N(u, o) distribution is given by 
Np = U+@'(p) -o. 
Conversely, if Z ~ N(O, 1) and yw and o are constants (with o > 0), then the 
“unstandardized” rv X = u + oZ has a normal distribution with mean yw and 
standard deviation o. 


Proof Let X ~ N(u, o). Then the cdf of Z = (X — y)/o is given by 


Fz(z) = P(Z<z) 


—_— 
oO 


Ute + 
1 2 
= P(X <pt+z20) = i f(x; bh, o)dx = / 7 OH) / 20”) ay. 
oV 20 


Now make the substitution u = (x — yw)/o. The new limits of integration become —oo to z, and the 
differential dx is replaced by o du, resulting in 
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Zz z 


1 2 1 2 
Fz(z) = / ae Gdu = / Jin’ du = ®(z) 


—oo —oo 


Thus, the cdf of (X — )/o is the standard normal cdf, so (X — w/o ~ NO, 1). 
The probability formulas in the statement of the proposition follow directly from this main result, 
as does the formula for the (100p)th percentile: 


p=P(X<1) =P(=" z —* =0(2—*) ee 


H a@-i = -1/n). 
5 5 - BT PP) > ty = H+ OP) 6 


The converse statement Z ~ N(O, 1) > uw + oZ ~ N(yu, o) is derived similarly. a 


The key idea of this proposition is that by standardizing, any probability involving X can be 
expressed as a probability involving a standard normal rv Z, so that the z table can be used. This is 
illustrated in Figure 4.21. 


N(O,1) 


N(UL0) 


Ll x 0 | 
(x- blo 


Figure 4.21 Equality of nonstandard and standard normal curve areas 


Software eliminates the need for standardizing X, although the standard normal distribution is still 
important in its own right. Table 4.2 at the end of this section details the relevant R commands, which 
are also illustrated in the following examples. 


Example 4.23 The authors of the article “Assessing the Importance of Surgeon Hand Anthro- 
pometry on the Design of Medical Devices” (J. Med. Devices 2017) investigate whether standard 
surgical instruments, such as some surgical staplers, might be too large for some physicians’ hands. 
According to their research, the proximal grip distance (a measure of one’s index finger) for male 
surgeons follows a normal distribution with mean 7.20 cm and standard deviation 0.51 cm. To use 
one particular stapler, the surgeon’s proximal grip distance must be at least 6.83 cm. What is the 
probability a male surgeon’s hand is large enough to use this stapler? If we let X denote the proximal 
grip distance of a randomly selected male surgeon, then standardizing gives X > 6.83 if and only if 


X — 7.20 = 6.83 — 7.20 
0.51 — 0.51 


Thus 


X —7.20 _ 6.83 — 7.20 
> 6. = ze 
Pines) ( 051 — O51 


=1—P(Z< — 0.73) =1— ®(-0.73) = 1 — .2327 = .7673 


) = P(Z> — 0.73) 
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This is illustrated in Figure 4.22. In other words, nearly a quarter of male surgeons would not be able 
to use this particular surgical stapler, because their hands are too small (or the stapler is too large, 
depending on your perspective). 


Normal, uw = 7.20, 0=.51 P(X= 6.83) 


Z Curve 


7.20 { 0 


6.83 —0.73 
Figure 4.22 Normal curves for Example 4.23 
As you might imagine, the situation is worse for female surgeons, whose proximal grip distance 


distribution can be modeled as N(6.58, 0.50). Denoting the appropriate rv by Y, the probability a 
female surgeon cannot use this stapler is 


Y — 6.58 2 6.83 — 6.58 
0.50 0.50 


P(Y <6.83) = ( ) = P(Z<0.5) = (0.5) = .6915 


Fortunately, as noted by the authors of the article, another brand of surgical stapler exists for which 
the required proximal grip distance is only 5.13 cm, meaning that practically all surgeons of either sex 
can comfortably use this other brand of stapler. a 


Example 4.24 The amount of distilled water dispensed by a machine is normally distributed with 
mean value 64 oz and standard deviation .78 oz. What container size c will ensure that overflow 
occurs only .5% of the time? If X denotes the amount dispensed, the desired condition is that 
P(X > c) = .005, or, equivalently, that P(X < c) = .995. Thus c is the 99.5th percentile of the normal 
distribution with 4“ = 64 and o = .78. The 99.5th percentile of the standard normal distribution is 
® '(.995) = 2.58, so 


C = Nos = 64 + (2.58)(.78) = 64 +2.0 = 66.0 o2 


This is illustrated in Figure 4.23. 


Shaded area = .995 


uU=64 


c = 99.5th percentile = 66.0 
Figure 4.23 Distribution of amount dispensed for Example 4.24 


The R command to calculate this percentile is qnorm(.995,64,.78). | 
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Example 4.25 The return on a diversified investment portfolio is normally distributed. What is the 
probability that the return is within 1 standard deviation of its mean value? This question can be 
answered without knowing either u or a, as long as the distribution is known to be normal. That is, 
the answer is the same for any normal distribution: 


( X is within one standard 


=Piu-o<X<y+o 
deviation of its mean ) (F Seer) 


= se ee 
oO oO 
= P(-1<Z<1) 


= (1) — ®(—1) = .6826 


The probability that X is within 2 standard deviations of the mean is P(-2 < Z < 2) = .9544 and 
the probability that X is within 3 standard deviations of the mean is P(-3 < Z < 3)=.9973. Hi 


The results of Example 4.25 are often reported in percentage form and referred to as the empirical 
rule (because empirical evidence has shown that histograms of real data can very frequently be 
approximated by normal curves). 


EMPIRICAL RULE If the population distribution of a variable is (approximately) normal, then 
1. Roughly 68% of the values are within 1 SD of the mean. 
2. Roughly 95% of the values are within 2 SDs of the mean. 
3. Roughly 99.7% of the values are within 3 SDs of the mean. 


It is indeed unusual to observe a value from a normal population that is much farther than 2 standard 
deviations from py. These results will be important in the development of hypothesis-testing proce- 
dures in later chapters. 


The Normal MGF 
The moment generating function provides a straightforward way to establish several important results 
concerning normal distributions. 


PROPOSITION The moment generating function of a normally distributed random variable X is 


Mx(t) — elit+ er /2 


Proof Consider first the special case of a standard normal rv Z. Then 


Completing the square in the exponent, we have 
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oo 


2 r 1 2) 2 2 1 5 
Mz(t) = e |? i e E22, — et /2 eo &) gy 
v2n V2n 


The last integral is the area under a normal density with mean ¢ and standard deviation 1, so the value 
of the integral is 1. Therefore, M,(t) = e”/?. 

Now let X be any normal rv with mean yp and standard deviation o. Then, by the proposition earlier 
in this section, (X — y)/o = Z, where Z is standard normal. Rewrite this relationship as X = p + OZ, 


and use the property Mzy ,»(t) = e’'My(at): 
Mx(t) = Mn+ oz(t) = e!“Mz(ot) = alter /2 = otto /2 . 


The normal megf can be used to establish that u~ and o are indeed the mean and standard deviation of 
X, as claimed earlier (Exercise 68). Also, by the mgf uniqueness property, any rv X whose moment 
generating function has the form specified above is necessarily normally distributed. For example, if it 
is known that the mgf of X is My(t) = e8” then X must be a normal rv with mean = O and standard 
deviation o = 4, since the N(0, 4) distribution has e®” as its mef. 

It was established earlier in this section that if X ~ N(u, 0) and Z = (X — w/o, then Z ~ N(O, 1), 
and vice versa. This standardizing transformation is actually a special case of a much more general 
property. 


PROPOSITION Let X ~ M(w, o). Then for any constants a and b with a 4 0, aX + b is also 
normally distributed. That is, any linear rescaling of a normal rv is normal. 


The proof of this proposition uses mgfs and is left as an exercise (Exercise 70). This proposition 
provides a much easier proof of the earlier relationship between X and Z. The rescaling formulas and 
this proposition combine to give the following statement: if X is normally distributed and 
Y=aX+b (a#0), then Y is also normal, with mean py =apfly+b and standard deviation 
Oy = la lo. XxX: 


The Normal Distribution and Discrete Populations 

The normal distribution is often used as an approximation to the distribution of values in a discrete 
population. In such situations, extra care must be taken to ensure that probabilities are computed in an 
accurate manner. 


Example 4.26 IQ (as measured by a standard test) is known to be approximately normally dis- 
tributed with w = 100 and o = 15. What is the probability that a randomly selected individual has an 
IQ of at least 125? Letting X = the IQ of a randomly chosen person, we wish P(X > 125). The 
temptation here is to standardize X > 125 immediately as in previous examples. However, the IQ 
population is actually discrete, since IQs are integer-valued. So, the normal curve is an approximation 
to a discrete probability histogram, as pictured in Figure 4.24. 
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J 


125 


Figure 4.24 A normal approximation to a discrete distribution 


The rectangles of the histogram are centered at integers, so IQs of at least 125 correspond to 
rectangles beginning at 124.5, as shaded in Figure 4.24. Thus we really want the area under the 
approximating normal curve to the right of 124.5. Standardizing this value gives P(Z > 1.63) = 
.0516. If we had standardized X > 125, we would have obtained P(Z > 1.67) = .0475. The dif- 
ference is not great, but the answer .0516 is more accurate. Similarly, P(X = 125) would be 
approximated by the area between 124.5 and 125.5, since the area under the normal curve above the 
single value 125 is zero. | 


The correction for discreteness of the underlying distribution in Example 4.26 is often called a 
continuity correction. It is useful in the following application of the normal distribution to the 
computation of binomial probabilities. The normal distribution was actually created as an approxi- 
mation to the binomial distribution (by Abraham de Moivre in the 1730s). 


Approximating the Binomial Distribution 
Recall that the mean value and standard deviation of a binomial random variable X are 1 = np and 
o = ,/npq, respectively. Figure 4.25a (p. 224) displays a probability histogram for the binomial 


distribution with n = 20, p = .6 [so u = 20(.6) = 12 and o = \/20(.6)(.4) = 2.19]. A normal curve 
with mean value and standard deviation equal to the corresponding values for the binomial distri- 
bution has been superimposed on the probability histogram. Although the probability histogram is a 
bit skewed (because p # .5), the normal curve gives a very good approximation, especially in the 
middle part of the picture. The area of any rectangle (probability of any particular X value) except 
those in the extreme tails can be accurately approximated by the corresponding normal curve area. 


Thus P(X = 10) = Ga) (.6)'°(.4)'° = .117, whereas the area under the normal curve between 9.5 


and 10.5 is P(-1.14 < Z < —.68) = .120. 

On the other hand, a normal distribution is a poor approximation to a discrete distribution that is 
heavily skewed. For example, Figure 4.25b shows a probability histogram for the Bin(20, .1) dis- 
tribution and the normal pdf with the same mean and standard deviation (wu = 2 and o = 1.34). 
Clearly, we would not want to use this normal curve to estimate binomial probabilities, even with a 
continuity correction. 


224 
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a b 
0.30 0.30 
normal curve, 
0.25 0.25 u=2,0=1.34 
0.20 normal curve, 0.20 4 
w=12,0=2.19 
0.15 0.15 
0.10 0.10 
0.05 0.05 io 
0.00 0.00 
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 


Figure 4.25 Binomial probability histograms with normal approximation curves superimposed: 
(a) n = 20 and p = .6 (a good fit); (b) n = 20 and p = .1 (a poor fit) 


PROPOSITION Let X be a binomial rv based on n trials with success probability p. Then if the 


binomial probability histogram is not too skewed, X has approximately a normal 
distribution with u = np and o = ,/npq. In particular, for x = a possible value of 
X, 


P(X <x) = B(x;n,p) © (area under the normal curve to the left of x+ .5) 
= 
-o (“=”) 
V"Pq 
In practice, the approximation is adequate provided that both np > 10 and 
nq = 10. 


If either np < 10 or ng < 10, the binomial distribution may be too skewed for the (symmetric) normal 
curve to give accurate approximations. 


Example 4.27 Suppose that 25% of all licensed drivers in a state do not have insurance. Let X be the 
number of uninsured drivers in a random sample of size 50 (somewhat perversely, a success is an 
uninsured driver), so that p = .25. Then uw = 12.5 and o = 3.062. Since np = 50(.25) = 12.5 > 10 
and nq = 37.5 > 10, the approximation can safely be applied: 


3.062 
= 0(—.65) = .2578 


104.5 —12.5 
P(X < 10) = B(10; 50,.25) = o( a) 


Similarly, the probability that between 5 and 15 (inclusive) of the selected drivers are uninsured is 


P(5<X < 15) = B(15; 50, .25) — B(4; 50, .25) 


oi 15.5 — 12.5 © 4.5 — 12.5 ~ 9320 
3.062 3.062 


The exact probabilities are .2622 and .8348, respectively, so the approximations are quite good. In the 
last calculation, the probability P(S < X < 15) is being approximated by the area under the normal 
curve between 4.5 and 15.5—the continuity correction is used for both the upper and lower limits. ll 
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The wide availability of software for doing binomial probability calculations, even for large values 
of n, has considerably diminished the importance of the normal approximation. However, it is 
important for another reason. When the objective of an investigation is to make an inference about a 
population proportion p, interest will focus on the sample proportion of successes P = X/n rather than 
on X itself. Because this proportion is just X multiplied by the constant 1/n, the earlier rescaling 
proposition tells us that P will also have approximately a normal distribution (with mean py = p and 
standard deviation ¢ = \/pq/n) provided that both np > 10 and ng > 10. This normal approxi- 
mation is the basis for several inferential procedures to be discussed in later chapters. 

It is quite difficult to give a direct proof of the validity of this normal approximation (the first one 
goes back almost 300 years to de Moivre). In Chapter 6, we’ll see that it is a consequence of an 
important general result called the Central Limit Theorem. 


Normal Distribution Calculations with Software 

Many software packages, including R, have built-in functions to determine both probabilities under a 
normal curve and quantiles (aka percentiles) of any given normal distribution. Table 4.2 summarizes 
the relevant R code. 


Table 4.2 Normal probability and quantile calculations in R 


Function cdf Quantile; i.e., the (100p)th percentile 
Notation o(=) Np =H+O'(p)-o 
R pnorm(x,l,c) qnorm(p,,¢) 


In the special case of a standard normal distribution, R will allow the user to drop the last two 
arguments, uz and o. That is, the R commands pnorm (x) and pnorm(x,0,1) yield the same result 
for any number x, and a similar comment applies to qnorm. R also has a built-in function for the 
normal pdf: dnorm(x,“,0). However, this function is generally only used when one desires to 
graph a normal density curve, x vs. f(x; u, @), since the pdf evaluated at particular x does not represent 
a probability (as discussed in Section 4.1). 


Exercises: Section 4.3 (39-70) 


39. Let Z be a standard normal random variable 40. In each case, determine the value of the 


and calculate the following probabilities, constant c that makes the probability 
drawing pictures wherever appropriate. statement correct. 

a. PO < Z < 2.17) a. O(c) = .9838 

b. PO < Z < 1) b. PO < Z < c)=.291 

c. P(-2.50 < Z < 0) ce. Pie < Z=.121 

d. P(-2.50 < Z < 2.50) d. P(i-c < Z < c)= .668 

e. P(Z < 1.37) e. P(c < |Z) = .016 

f P1715 < Z) 41. Find the following percentiles for the 
g. P(-1.50 < Z < 2.00) standard normal distribution. Interpolate 
h. PU.37 < ZS 2.50) where appropriate. 

i. PU.50 < Z) 

j. PZ) < 2.50) a. 91st 


b. 9th 
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42. 


43. 


44, 


45. 


46. 
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c. 75th 
d. 25th 
e. 6th 


Determine z,, for the following: 


a. a& = .0055 
b. a = .09 
c. 4 = .663 


If X is a normal rv with mean 80 and 
standard deviation 10, compute the fol- 
lowing probabilities by standardizing: 


a. P(X < 100) 
b. P(X < 80) 

c. P(65 < X < 100) 
d. P(70 < X) 

e. P(85 < X < 95) 

f. P(X — 80] < 10) 


The plasma cholesterol level (mg/dL) for 
patients with no prior evidence of heart 
disease who experience chest pain is nor- 
mally distributed with mean 200 and stan- 
dard deviation 35. Consider randomly 
selecting an individual of this type. What is 
the probability that the plasma cholesterol 
level 


a. Is at most 250? 

b. Is between 300 and 400? 

c. Differs from the mean by at least 1.5 
standard deviations? 


The article “Reliability of Domestic-Waste 
Biofilm Reactors” (J. Envir. Engr. 1995: 
785-790) suggests that substrate concen- 
tration (mg/cm?) of influent to a reactor is 
normally distributed with p= .30 and 
o = .06. 


a. What is the probability that the concen- 
tration exceeds .25? 

b. What is the probability that the con- 
centration is at most .10? 

c. How would you characterize the largest 
5% of all concentration values? 


Suppose the diameter at breast height (in.) 
of trees of a certain type is normally dis- 
tributed with pw = 8.8 and o = 2.8, as sug- 
gested in the article “Simulating a 


47. 


48. 


49. 


Harvester-Forwarder Softwood Thinning” 
(Forest Products J., May 1997: 36-41). 


a. What is the probability that the diameter 
of a randomly selected tree will be at 
least 10 in.? Will exceed 10 in.? 

b. What is the probability that the diameter 
of a randomly selected tree will exceed 
20 in.? 

c. What is the probability that the diameter 
of a randomly selected tree will be 
between 5 and 10 in.? 

d. What value c is such that the interval 
(8.8 — c, 8.8 +c) includes 98% of all 
diameter values? 

e. If four trees are independently selected, 
what is the probability that at least one 
has a diameter exceeding 10 in.? 


There are two machines available for cut- 
ting corks intended for use in wine bottles. 
The first produces corks with diameters that 
are normally distributed with mean 3 cm 
and standard deviation .1 cm. The second 
machine produces corks with diameters that 
have a normal distribution with mean 
3.04 cm and standard deviation .02 cm. 
Acceptable corks have diameters between 
2.9 and 3.1 cm. Which machine is more 
likely to produce an acceptable cork? 
Human body temperatures for healthy 
individuals have approximately a normal 
distribution with mean 98.25 °F and stan- 
dard deviation .75 °F. (The past accepted 
value of 98.6 °F was obtained by convert- 
ing the Celsius value of 37°, which is cor- 
rect to the nearest integer.) 


a. Find the 90th percentile of the 
distribution. 

b. Find the Sth percentile of the 
distribution. 


c. What temperature separates the coolest 
25% from the others? 


The article “Monte Carlo Simulation—Tool 
for Better Understanding of LRFD” 
(J. Struct. Engr. 1993: 1586-1599) sug- 
gests that yield strength (ksi) for A36 grade 
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50. 


51. 


52. 


53. 


steel is normally distributed with py = 43 
and og = 4.5. 


a. What is the probability that yield 
strength is at most 40? Greater than 60? 

b. What yield strength value separates the 
strongest 75% from the others? 


The automatic opening device of a military 
cargo parachute has been designed to open 
when the parachute is 200 m above the 
ground. Suppose opening altitude actually 
has a normal distribution with mean value 
200m and_= standard deviation 30 m. 
Equipment damage will occur if the para- 
chute opens at an altitude of less than 
100 m. What is the probability that there 
is equipment damage to the payload of 
at least 1 of 5 independently dropped 
parachutes? 


The temperature reading from a thermo- 
couple placed in a constant temperature 
medium is normally distributed with mean 
Lt, the actual temperature of the medium, 
and standard deviation o. What would the 
value of o have to be to ensure that 95% of 
all readings are within .1° of yu? 


The distribution of resistance for resistors 
of a certain type is known to be normal, 
with 10% of all resistors having a resistance 
exceeding 10.256Q and 5% having a 
resistance smaller than 9.671 ohms. What 
are the mean value and standard deviation 
of the resistance distribution? 


If adult female heights are normally dis- 
tributed, what is the probability that the 
height of a randomly selected woman is 


a. Within 1.5 SDs of its mean value? 

b. Farther than 2.5 SDs from its mean 
value? 

c. Between 1 and 2 SDs from its mean 
value? 
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54. A machine that produces ball bearings has 


55. 


initially been set so that the true average 
diameter of the bearings it produces is .500 
in. A bearing is acceptable if its diameter is 
within .004 in. of this target value. Sup- 
pose, however, that the setting has changed 
during the course of production, so that the 
bearings have normally distributed diame- 
ters with mean value .499 in. and standard 
deviation .002 in. What percentage of the 
bearings produced will not be acceptable? 


The Rockwell hardness of a metal is 
determined by impressing a hardened point 
into the surface of the metal and then 
measuring the depth of penetration of the 
point. Suppose the Rockwell hardness of an 
alloy is normally distributed with mean 70 
and standard deviation 3. (Rockwell hard- 
ness is measured on a continuous scale.) 


a. If a specimen is acceptable only if its 
hardness is between 67 and 75, what is the 
probability that a randomly chosen spec- 
imen has an acceptable hardness? 

b. If the acceptable range of hardness is 
(70 — c, 70+ c), for what value of 
c would 95% of all specimens have 
acceptable hardness? 

c. If the acceptable range is as in part 
(a) and the hardness of each of ten ran- 
domly selected specimens is indepen- 
dently determined, what is the expected 
number of acceptable specimens among 
the ten? 

d. What is the probability that at most 8 of 
10 independently selected specimens 
have a hardness of less than 73.84? 
[Hint: Y = the number among the ten 
specimens with hardness less than 73.84 
is a binomial variable; what is p?] 
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56. 


57. 


58. 
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The weight distribution of parcels sent in a 
certain manner is normal with mean value 
12 Ib and standard deviation 3.5 lb. The 
parcel service wishes to establish a weight 
value c beyond which there will be a sur- 
charge. What value of c is such that 99% of 
all parcels are at least 1 Ib under the sur- 
charge weight? 


Suppose Appendix Table A.3 contained 
@(z) only for z > 0. Explain how you 
could still compute 


a. P(-1.72 < Z < -.55) 
b. P(-1.72 < Z < .55) 


Is it necessary to table O(z) for z negative? 
What property of the standard normal 
curve justifies your answer? 


Let X be the birth weight, in grams, of a 
randomly selected full-term baby. The 
article “Fetal Growth Parameters and Birth 
Weight: Their Relationship to Neonatal 
Body Composition” (Ultrasound Obstetrics 
Gynecol. 2009: 441-446) suggests that X is 
normally distributed with mean 3500 and 
standard deviation 600. 


a. Sketch the relevant density curve, 
including tick marks on the horizontal 
scale. 

b. What is P(3000 < X < 4500), and how 
does this compare to P(3000 < 
X < 4500)? 

c. What is the probability that the weight of 
such a newborn is less than 2500 g? 

d. What is the probability that the weight 
of such a newborn exceeds 6000 g 
(roughly 13.2 Ib)? 

e. How would you characterize the most 
extreme .1% of all birth weights? 

f. Use the rescaling proposition from this 
section to determine the distribution of 
birth weight expressed in pounds (shape, 
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mean, and standard deviation), and then 
recalculate the probability from part (c). 
How does this compare to your previous 
answer? 


Based on extensive data from an urban 
freeway near Toronto, Canada, “it is 
assumed that free speeds can best be rep- 
resented by a normal distribution” [“Impact 
of Driver Compliance on the Safety and 
Operational Impacts of Freeway Variable 
Speed Limit Systems” (J. Transp. Engr. 
2011: 260—268)]. The mean and standard 
deviation reported in the article were 
119 km/h and 13.1 km/h, respectively. 


a. What is the probability that the speed of 
a randomly selected vehicle is between 
100 and 120 km/h? 

b. What speed characterizes the fastest 
10% of all speeds? 

c. The posted speed limit was 100 km/h. 
What percentage of vehicles was trav- 
eling at speeds exceeding this posted 
limit? 

d. If five vehicles are randomly and inde- 
pendently selected, what is the proba- 
bility that at least one is not exceeding 
the posted speed limit? 

e. What is the probability that the speed of 
a randomly selected vehicle exceeds 70 
miles per hour? 


Chebyshev’s inequality, introduced in 
Chapter 3 Exercise 45, is valid for contin- 
uous as well as discrete distributions. It 
states that for any number &k > 1, 
P(\X—p|>ko) < 1/k (see the afore- 
mentioned exercise for an interpretation 
and Chapter 3 Exercise 163 for a proof). 
Obtain this probability in the case of a 
normal distribution for k = 1, 2, and 3, and 
compare to the Chebyshev upper bound. 
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63. 


64. 


Let X denote the number of flaws along a 
100-m reel of magnetic tape (an integer- 
valued variable). Suppose X has approxi- 
mately a normal distribution with p = 25 
and o = 5. Use the continuity correction to 
calculate the probability that the number of 
flaws is 


a. Between 20 and 30, inclusive. 
b. At most 30. Less than 30. 


Let X have a binomial distribution with 
parameters n = 25 and p. Calculate each of 
the following probabilities using the normal 
approximation (with the continuity correc- 
tion) for the cases p = .5, .6, and .8 and 
compare to the exact probabilities calcu- 
lated from Appendix Table A.1. 


a. P05 < X < 20) 
b. P(X < 15) 
c. P20 < X) 


Suppose that 10% of all steel shafts pro- 
duced by a process are nonconforming but 
can be reworked (rather than having to be 
scrapped). Consider a random sample of 
200 shafts, and let X denote the number 
among these that are nonconforming and 
can be reworked. What is the (approximate) 
probability that X is 


a. At most 30? 
b. Less than 30? 
c. Between 15 and 25 (inclusive)? 


Suppose only 70% of all drivers in a state 
regularly wear a seat belt. A random sample 
of 500 drivers is selected. What is the 
probability that 


a. Between 320 and 370 (inclusive) of the 
drivers in the sample regularly wear a 
seat belt? 

b. Fewer than 325 of those in the sample 
regularly wear a seat belt? Fewer than 
315? 


65. 


66. 
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In response to concerns about nutritional 
contents of fast foods, McDonald’s 


announced that it would use a new cooking 
oil for its French fries that would decrease 
substantially trans-fatty acid levels and 
increase the amount of more beneficial 
polyunsaturated fat. The company claimed 
that 97 out of 100 people cannot detect a 
difference in taste between the new and old 
oils. Assuming that this figure is correct (as 
a long-run proportion), what is the 
approximate probability that in a random 
sample of 1000 individuals who have pur- 
chased fries at McDonald’s, 


a. At least 40 can taste the difference 
between the two oils? 

b. At most 5% can taste the difference 
between the two oils? 

The following proof that the normal pdf 

integrates to 1 comes courtesy of Professor 

Robert Young, Oberlin College. Let f(z) 

denote the standard normal pdf, and con- 

sider the function of two variables 


g(x,y) =F) -f(y) 


= er /2 1 ey /2 
V2n V2n 
1 -w@+y)/2 


Let V denote the volume under the graph of 
g(x, y) above the xy-plane. 


a. Let A denote the area under the standard 
normal curve. By setting up the double 
integral for the volume underneath g(x, y), 
show that V = A”. 

b. Using the rotational symmetry of g(x, y), 
V can be determined by adding up the 
volumes of shells from rotation about 
the y-axis: 
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ae 
va f amr e "Pdr 
2m 
0 


Show this integral equals 1, then use (a) to 
establish that the area under the standard 
normal curve is 1. 

c. Show that [°° f(x; u,0)dx = 1. [Hint: 
Write out the integral, and then make a 
substitution to reduce it to the standard 
normal case. Then invoke (b).] 

Suppose X ~ Nu, o). 

a. Show via integration that E(X) = w. 
[Hint: Make the substitution 
u=(x— w/o, which will create two 
integrals. For one, use the symmetry of 
the pdf; for the other, use the fact that 
the standard normal pdf integrates to 1.] 

b. Show via integration that V(X) = Oe 
[Hint: Evaluate the integral for 
E|(X — y)°] rather than using the vari- 
ance shortcut formula. Use the same 
substitution as in part (a).] 

The moment generating function can be 

used to find the mean and variance of the 

normal distribution. 

a. Use derivatives of My(t) to verify that 
E(X) = wand V(X) = o°. 

b. Repeat (a) using Rx(2) = In[My()], and 
compare with part (a) in terms of effort. 

There is no nice formula for the standard 

normal cdf @(z), but several good 
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approximations have been published in 
articles. The following is from “Approxi- 
mations for Hand Calculators Using Small 
Integer Coefficients” (Math. Comput. 1977: 
214-222). ForO<z < 5.5, 


P(Z>z) =1-— ®(z) 


soo [Sui 


2 


The relative error of this approximation is 
less than .042%. Use this to calculate 
approximations to the following probabili- 
ties, and compare whenever possible to the 
probabilities obtained from Appendix 
Table A.3. 

. PZ > 1) 

. P(Z < -3) 

. PA<Z<4) 

. P(Z>5) 


a. Use mgfs to show that if X has a normal 
distribution with parameters fy and ox, 
then Y= aX +b (a linear function of 
X) also has a normal distribution. What 
are the parameters of the distribution of 
Y [ie., wy and oy]? 

b. If when measured in °C, temperature is 
normally distributed with mean 115 and 
standard deviation 2, what can be said 
about the distribution of temperature 
measured in °F? 


a 
b 
c 
d 


4.4 The Gamma Distribution and Its Relatives 


The graph of any normal pdf is bell-shaped and thus symmetric. But in many situations, the variable 
of interest to the experimenter might have a skewed distribution. A family of pdfs that yields a wide 
variety of skewed distributional shapes is the gamma family. To define the family of gamma dis- 
tributions, we first need to introduce a function that plays an important role in many branches of 
mathematics. 
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DEFINITION For « > 0, the gamma function I(«) is defined by 
— i a ey 
0 


The most important properties of the gamma function are the following: 
1. For any « > 1, ['(@) = (a2 — 1) - I'(@ — 1) (via integration by parts) 
2. For any positive integer, n, (mn) = (n — 1)! 


3. 1) = va 


The following proposition will prove useful for several computations that follow. 


PROPOSITION For any a, 6 > 0, 


8 


x 1e-*/Bdy = B*T (a) (4.5) 


o 


Proof Make the substitution u = x/f, so that x = Bu and dx = B du: 


Co 


fp miettin= | (Bu)* 'e“ du = B” . “le“du = B°T (a) 


0 


The last equality comes from the definition of the gamma function. a 


The Family of Gamma Distributions 
With the preceding proposition in mind, we make the following definition. 


DEFINITION A continuous random variable X is said to have a gamma distribution 
if the pdf of X is 


I 
BT (a) 


where the parameters « and f satisfy « > 0, 6 > 0. When f = 1, X is said to 
have a standard gamma distribution, and its pdf may be denoted f(x; «). 


f(x; «, B) = xt le /B oy > 0 (4.6) 
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It’s clear that f(x; «, 6) > 0 for all x; the previous proposition guarantees that this function integrates 
to 1, as required. Figure 4.26a illustrates the graphs of the gamma pdf for several («, /) pairs, whereas 
Figure 4.26b presents graphs of the standard gamma pdf. For the standard pdf, when « < 1, f(x; «) is 
strictly decreasing as x increases; when « > 1, f(x; «) rises to a maximum and then decreases. The 
parameter f in (4.6) is called a scale parameter because values other than | either stretch or compress 
the pdf in the x direction. 


a 
f(x; a, B) 


> xX 
Figure 4.26 (a) Gamma density curves; (b) standard gamma density curves 
PROPOSITION The moment generating function of a gamma random variable is 
My(t) = 5 — 
Xl) = 7 ana 
(1 — fr) 
Proof By definition, the mgf is 
M. (t) E( fe) / ing a —x/B / as —x(—t+ 1/8) 
x(t) = E(e*) = | e* =e VF dx = ze x 
P(a)p M(a)p 
0 0 
Now use Expression (4.5): provided -1 + 1/8 > 0, ie., t < 1/f, 
1 / as ere 1 ( 1 ) 1 
~| xe “dx = ——_ -_ T (a) = 7 | 
M(x) B" J P(a)p —t+1/B/ (1 — Br) 


The mean and variance can be obtained from the moment generating function (Exercise 82), but they 
can also be obtained directly through integration (Exercise 83). 


PROPOSITION The mean and variance of a random variable X having the gamma 
distribution f(x; «, /) are 


44 The Gamma Distribution and Its Relatives 233 


E(X)=p=a8 V(X)=0' =af? 


When X is a standard gamma rv, the cdf of X, which is 


G(x;0) = le ce dy x>0 (4.7) 


is called the incomplete gamma function. (In mathematics literature, the incomplete gamma function 
sometimes refers to (4.7) without the denominator I°(«) in the integrand.) In Appendix Table A.4, we 
present a small tabulation of G(x; «) for « = 1, 2, ..., 10 and x = 1, 2, ..., 15. Table 4.3 (p. 236) 
provides the R commands related to the gamma cdf, which are illustrated in the following examples. 


Example 4.28 Suppose the reaction time X (in seconds) of a randomly selected individual to a 
certain stimulus has a standard gamma distribution with « = 2. Since X is continuous, 


P(3<X <5) =P(X <5) — P(X <3) = G(5;2) — G(3; 2) = .960 — .801 = 159 


This probability can be obtained in R with pgamma (5,2) - pgamma (3,2). 
The probability that the reaction time is more than 4 s is 


P(X > 4) =1— P(X <4) =1- G(4; 2) = 1— .908 = .092 | 


The incomplete gamma function can also be used to compute probabilities involving gamma dis- 
tributions for any f > 0. 


PROPOSITION Let X have a gamma distribution with parameters « and f. Then for 
any x > 0, the cdf of X is given by 


Pixs) =6(F:4), 


the incomplete gamma function evaluated at x/f. 


The proof is similar to that of Expression (4.5). 


Example 4.29 Web servers typically have security algorithms that detect and flag ‘“‘abnormal” 
connections from suspicious IP addresses, which can indicate possible hackers. Data from the article 
“Exact Inferences for a Gamma Distribution” (J. Quality Technol. 2014: 140-149) suggests that, for 
one particular server receiving abnormal connections from one specific IP address, the time X in hours 
between attempted connections can be modeled using a gamma distribution with « = 2 and f = 2.5. 
(In fact, the article provides a range of estimates for the parameters; we’ll encounter such interval 
estimates in Chapter 8.) The average time between connections from this suspicious IP address is 
E(X) = (2)(2.5) = 5 h, whereas V(X) = (2)(2.5)? = 12.5 and oy = V12.5 & 3.5 h. The probability 
that a connection from this suspicious IP address will arrive between 5 and 10 h after the previous 
attempt is 
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P(5<X<10) = P(X< 10) — P(X<5) 
= G(10/2.5; 2) — G(5/2.5; 2) 
G(4;2) — G(2; 2) = .908 — .594 = .314 


The probability that two connection attempts from this IP address are separated by more than 15 h is 
P(X > 15) = 1-—P(X< 15) 
= 1 — G(15/2.5; 2) = 1 — G(6; 2) = 1 — .983 = .017 
Software can also perform these calculations. For instance, the R commands 
pgamma (10,2,1/2.5)-pgamma(5,2,1/2.5) and 1-pgamma(15,2,1/2.5) 
compute the two probabilities above and return .3144 and .0174, respectively. | 
The Exponential Distribution 


The family of exponential distributions provides probability models that are widely used in engi- 
neering and science disciplines. 


DEFINITION _ X is said to have an exponential distribution with parameter 1 (A > 0) 
if the pdf of X is 


f(x; A) =de“* x>0 (4.8) 


The exponential pdf is a special case of the general gamma pdf (4.6) in which « = 1 and f = 1/A; 
some sources write the exponential pdf in the form (1/B)e~” ® The mean and variance of X are then 


Both the mean and standard deviation of the exponential distribution equal 1/1. Graphs of several 
exponential pdfs appear in Figure 4.27. 
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Figure 4.27 Exponential density curves 
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Unlike the general gamma pdf, the exponential pdf can be easily integrated. In particular, the cdf 
of X is 


0) x<0 
F(x; 4) = oes x>0 


Example 4.30 The response time X at an online computer terminal (the elapsed time between the 
end of a user’s inquiry and the beginning of the system’s response to that inquiry) has an exponential 
distribution with expected response time equal to 5 s. Then E(X) = 1/2 = 5, so A = .2. The proba- 
bility that the response time is at most 10 s is 


P(X <10) = F(10; .2) = 1—e (709 = 1 —e? =1— 135 = .865 
The probability that response time is between 5 and 10 s is 


P(5<X<10) = F(10;.2) — F(5; .2) = (1—e-?) — (1—e71) = .233 | 


The exponential distribution is frequently used as a model for the distribution of times between the 
occurrence of successive events, such as customers arriving at a service facility or calls coming into a 
switchboard. The reason for this is that the exponential distribution is closely related to the Poisson 
process discussed in Chapter 3. 


THEOREM Suppose that the number of events occurring in any time interval of length t has a 
Poisson distribution with parameter uw = At (where J, the rate of the event process, is 
the expected number of events occurring in | unit of time) and that numbers of 
occurrences in nonoverlapping intervals are independent of one another. Then the 
distribution of elapsed time between the occurrence of two successive events is 
exponential with parameter /. 


Although a complete proof is beyond the scope of the text, the result is easily verified for the time X, 
until the first event occurs: 


P(X; <t) = 1— P(X, > t) = 1 — P(no events in (0, ¢}) 


= 94\0 
=] é 4. (At) =1 et 


0! 


which is exactly the cdf of the exponential distribution. 


Example 4.31 Video-on-demand services must carefully model customers’ or clients’ requests for 
videos to optimize the use of the available bandwidth. The article “Distributed Client-Assisted 
Patching for Multicast Video-on-Demand Service in an Enterprise Network” (J. Comput. 2017: 


236 4 Continuous Random Variables and Probability Distributions 


511-520) describes a series of experiments in this area, where client requests are modeled by a 
Poisson process. In one such experiment, the “request rate” was 2 = 0.8 requests per second. Then 
the time X between successive requests has an exponential distribution with parameter value 0.8. 
The probability that more than 2 s elapse between requests is 


P(X > 2) =1—P(X<2) =1— F(2; 0.8) =e 0) = 202 


The average time between requests under this setting is E(X) = 1/2 = 1/0.8 = 1.25 s (you could also 
deduce this directly from the rate without using the exponential model). a 


Another important application of the exponential distribution is to model the distribution of com- 
ponent lifetime. A partial reason for the popularity of such applications is the “memoryless” 
property of the exponential distribution. Suppose component lifetime is exponentially distributed 
with parameter 2. After putting the component into service, we leave for a period of fg h and then 
return to find the component still working; what now is the probability that it lasts at least an 
additional t hours? In symbols, we wish P(X > t+ to | X > to). By the definition of conditional 
probability, 


Pl(X>t+t0)A(X>t)] 
P(X >to) 


P(X >t+t|X>to) = 


But the event X > fp in the numerator is redundant, since both events can occur if and only if 
X > t+ 1%. Therefore, 


P(X>t+) 1-F(t+H;A)_ ett) 


(X2 1+ tolX2 0) P(X > to) 1 — F(to; A) ew 


This conditional probability is identical to the original probability P(X > 1) that the component 
lasted t hours. Thus the distribution of additional lifetime is exactly the same as the original dis- 
tribution of lifetime, so at each point in time the component shows no effect of wear. In other words, 
the distribution of remaining lifetime is independent of current age. 

Although the memoryless property can be justified at least approximately in many applied 
problems, in other situations components deteriorate with age or occasionally improve with age (at 
least up to a certain point). More general lifetime models are then furnished by the gamma, Weibull, 
and lognormal distributions (the latter two are discussed in the next section). 


Gamma and Related Calculations with Software 

Table 4.3 summarizes the syntax for the gamma and exponential cdfs in R, which follows the pattern 
of the other distributions. In a sense, the exponential commands are redundant, since they are just a 
special case (~% = 1) of the gamma distribution. 


Table 4.3 R code for gamma and exponential cdfs 


Gamma cdf Exponential cdf 


Notation G/B; «) Fo A= 1-e* 
R pgamma (x, a, 1/B) pexp (x, 2) 
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Notice how R parameterizes the distributions: for both the gamma and exponential cdfs, the R 
functions take as their last input the “rate” parameter 4 = 1/£. So, for the gamma rv with parameters 
a = 2 and f = 2.5 from Example 4.29, P(X < 15) would be evaluated as pgamma(15,2,1/2.5). 
This can be remedied by using a name assignment in the last argument in R; specifically, 
pgamma(15,2,scale=2.5) will instruct R to use f = 2.5 in its gamma probability calculation 
and produce the same answer as the previous expressions. Interestingly, as of this writing the same 
option does not exist in the pexp function. 

To graph gamma or exponential density curves, one can request their pdfs in R by replacing the 
leading letter ~ with d. To find quantiles of either of these distributions, the appropriate replacement 
is q. For example, the 75th percentile of the gamma distribution from Example 4.29 can be deter- 
mined with qgamma(.75,2,scale = 2.5). 


Exercises: Section 4.4 (71-83) 


71. Evaluate the following: distribution with mean 24 weeks and stan- 
dard deviation 12 weeks. 


a. T'(6) 
b. F(5/2) a. What is the probability that a transistor 
c. G(4; 5) (the incomplete gamma function) will last between 12 and 24 weeks? 
d. G(5; 4) b. What is the probability that a transistor 
e. G(0; 4) will last at most 24 weeks? Is the med- 
ian of the lifetime distribution less than 
72. Let X have a standard gamma distribution 24? Why or why not? 
with « = 7. Evaluate the following: c. What is the 99th percentile of the life- 
a. P(X < 5) time distribution? 
b. P(X <5) d. Suppose the test will actually be termi- 
c. P(X > 8) nated after t weeks. What value of ¢ is 
d. PB < X < 8) such that only .5% of all transistors 
e. P3B<X<8) would still be operating at termination? 
f. P(X <4 or X > 6) 75. Let X = the time between two successive 
73. Suppose the time spent by a randomly arrivals at the drive-up window of a local 
selected student at a campus computer bank. If X has an exponential distribution 
laboratory has a gamma distribution with with 2 = 1 (which is identical to a standard 
mean 20 min and variance 80 min”. gamma distribution with « = 1), compute 
a. What are the values of « and 8? the following: 
b. What is the probability that a student uses a. The expected time between two succes- 
the laboratory for at most 24 min? sive arrivals 
c. What is the probability that a student b. The standard deviation of the time 
spends between 20 and 40 min at the between successive arrivals 
laboratory? c. P(X < 4) 


d. P22 < X < 5) 

74. Suppose that when a type of transistor is 76 
subjected to an accelerated life test, the 
lifetime X (in weeks) has a gamma 


. Let X denote the distance (m) that an ani- 
mal moves from its birth site to the first 
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77. 


78. 
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territorial vacancy it encounters. Suppose 
that for banner-tailed kangaroo rats, X has 
an exponential distribution with parameter 
A = .01386 (as suggested in the article 
“Competition and Dispersal from Multiple 
Nests,” Ecology 1997: 873-883). 


a. What is the probability that the distance 
is at most 100m? At most 200 m? 
Between 100 and 200 m? 

b. What is the probability that distance 
exceeds the mean distance by more than 
2 standard deviations? 

c. What is the value of the median dis- 
tance? 


In studies of anticancer drugs it was found 
that if mice are injected with cancer cells, 
the survival time can be modeled with the 
exponential distribution. Without treatment 
the expected survival time was 10 h. What 
is the probability that 


a. A randomly selected mouse will survive 
at least 8 h? At most 12 h? Between 8 
and 12 h? 

b. The survival time of a mouse exceeds 
the mean value by more than 2 standard 
deviations? More than 3 standard devi- 
ations? 


The special case of the gamma distribution 
in which « is a positive integer n is called 
an Erlang distribution. If we replace fh by 
1/A in Expression (4.6), the Erlang pdf is 


a —1,-4% 
f(x; 4,7) es rn ha le x>0 


It can be shown that if the times between 
successive events are independent, each 
with an exponential distribution with 
parameter /, then the total time X that 
elapses before all of the next n events occur 
has pdf f(x; 2, n). 


a. What is the expected value of X? If the 
time (in minutes) between arrivals of 


successive customers is exponentially 
distributed with 2 = .5, how much time 
can be expected to elapse before the 
tenth customer arrives? 

b. If customer interarrival time is expo- 
nentially distributed with 2 = .5, what is 
the probability that the tenth customer 
(after the one who has just arrived) will 
arrive within the next 30 min? 

c. The event {X < t} occurs if and only if 
at least n events occur in the next f units 
of time. Use the fact that the number of 
events occurring in an interval of length 
t has a Poisson distribution with mean At 
to write an expression (involving Pois- 
son probabilities) for the Erlang cumu- 
lative distribution function F(t; A, n) = 
P(X < 2). 


79. A system consists of five identical compo- 


nents connected in series as shown: 


1 2 3 4 5) 


As soon as one component fails, the entire 
system will fail. Suppose each component 
has a lifetime that is exponentially dis- 
tributed with 2 = .01 and that components 
fail independently of one another. Define 
events A; = {ith component lasts at least 
t hours}, i= 1, ..., 5, so that the A,’s are 
independent events. Let X = the time at 
which the system fails—that is, the shortest 
(minimum) lifetime among the five com- 
ponents. 


a. The event {X > t} is equivalent to what 
event involving A, ..., As? 

b. Using the independence of the five A;’s, 
compute P(X > ft). Then obtain F(A) = 
P(X < ?t)and the pdf of X. What type of 
distribution does X have? 

c. Suppose there are n components, each 
having exponential lifetime with 
parameter 2. What type of distribution 
does X have? 
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80. If X has an exponential distribution with 
parameter 1, derive a general expression for 
the (100p)th percentile of the distribution. 
Then specialize to obtain the median. 


What distribution are the authors spec- 
ifying for W? Identify the name and 
parameter values of the distribution. 

b. If the variance of W is 3.0 (one of sev- 


81. The article “Numerical Prediction of Sur- eral values considered in the article), 
face Wear and Roughness Parameters what are the numerical values of the 
During Running-In for Line Contacts parameters of W’s distribution, and what 
Under Mixed Lubrication” (J. Tribol., Nov. is the value of the A parameter for X’s 
2018) proposes probability models for exponential distribution? 


several variables that arise in studying the 
wear of mechanical components like gears 
and piston rings. These variables include 


X = wear particle thickness (microns) and 
W = wear loss (cubic microns). 83. Determine the mean and variance of the 


gamma distribution by first using integra- 
tion to obtain E(X) and BOO; [Hint: 
Express the integrand in terms of a gamma 
density, and use Expression (4.5).] 


82. Determine the mean and variance of the 
gamma distribution by differentiating the 
moment generating function M(t). 


a. The article’s authors make mathematical 
arguments that (1) X should follow an 
exponential distribution and (2) the pdfs 
of W and X should be related by 


f(w) x w? - fx(w) 


4.5 Other Continuous Distributions 


The normal, gamma (including exponential), and uniform families of distributions provide a wide 
variety of probability models for continuous variables, but there are many practical situations in 
which no member of these families fits a set of observed data very well. Statisticians and other 
investigators have developed other families of distributions that are often appropriate in practice. 


The Weibull Distribution 

The family of Weibull distributions was introduced by the Swedish physicist Waloddi Weibull in 
1939; his 1951 article “A Statistical Distribution Function of Wide Applicability” (J. Appl. Mech. 18: 
293-297) discusses a number of applications. 


DEFINITION A random variable X is said to have a Weibull distribution with parameters « 
and 6 (a > 0, B > 0) if the pdf of X is 


f(x; a, B) = gee er x>0 (4.9) 


In some situations there are theoretical justifications for the appropriateness of the Weibull distri- 
bution, but in many applications f(x; «, 6) simply provides a good fit to observed data for particular 
values of « and f. When « = 1, the pdf reduces to the exponential distribution (with 2 = 1/f), so the 
exponential distribution is a special case of both the gamma and Weibull distributions. However, 
there are gamma distributions that are not Weibull distributions and vice versa, so one family is not a 
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subset of the other. Both « and f can be varied to obtain a number of different distributional shapes, as 
illustrated in Figure 4.28. Note that f is a scale parameter, so different values stretch or compress the 
graph in the x-direction. 


f(x) 
A 


Boe. 


a =1, B= 1 (exponential) 


a@=2,B=1 


a=2,B=.5 


Figure 4.28 Weibull density curves 


Integrating to obtain E(X) and EP) yields the mean and variance of X: 


y= pr(1+ -) Be e{r(1s =) ~ r(u } 


The computation of j and o7 thus necessitates using the gamma function from the previous section. 
(The mgf of the Weibull distribution is very complicated, and so we do not include it here.) On the 
other hand, the integration Bi f(y; &, B)dy is easily carried out to obtain the cdf of X: 


0 x<0 
F(x; a, B) = { te G/B" x50 (4.10) 


Example 4.32 One of the most common applications of the Weibull distribution is to model the time 
to repair for some item under industrial use. The article “Supply Chain Inventories of Engineered 
Shipping Containers” (ntl. J. Manuf. Engr. 2016) discusses modeling the time to repair for highly 
engineered reusable shipping containers, which are quite expensive and need to be monitored care- 
fully. For one specific application, the article suggests using a Weibull distribution with « = 10 and 
fh = 3.5 (the time to repair, X, is measured in months). 

The expected time to repair, variance, and standard deviation are 


1 
w=3.5- r(i + ih = 3.33 months 


10 


2 1\]? 
oF 2 ‘ { = = 
a = (3.5) {r(14 a) Ir(1 | a) ! 0.16 => o = 0.4 months 


The probability that a shipping container requires repair within the first 3 months is 
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P(X <3) = F(3; 10,3.5) = 1 —e- 8/35)” = 193 


Similarly, P(2 < X < 4) = .974, indicating that the distribution is almost entirely concentrated 
between 2 and 4 months. 

The 95th percentile of this distribution—i.e., the value c which separates the longest-lasting 5% of 
shipping containers from the rest—is determined from 


95 = 1 —@/35)” 
Solving this equation gives c © 3.906 months. a 


Frequently, a Weibull model may be reasonable except that the smallest possible X value may be 
some value y not assumed to be zero (this would also apply to a gamma model). The quantity y can 
then be regarded as a third parameter of the distribution, which is what Weibull did in his original 
work. For, say, y = 3, all curves in Figure 4.28 would be shifted 3 units to the right. This is equivalent 
to saying that X — y has the pdf (4.9), so that the cdf of X is obtained by replacing x in (4.10) by 


x—y. 


Example 4.33 An understanding of the volumetric properties of asphalt is important in designing 
mixtures that will result in high-durability pavement. The article “Is a Normal Distribution the Most 
Appropriate Statistical Distribution for Volumetric Properties in Asphalt Mixtures” (J. Testing Eval., 
Sept. 2009: 1-11) used the analysis of some sample data to recommend that for a particular mixture, 
X = air void volume (%) be modeled with a three-parameter Weibull distribution. Suppose the values 
of the parameters are y = 4, « = 1.3, and f = .8 (quite close to estimates given in the article). 

For x > 4, the cumulative distribution function is 


F(x; a, B,y) = F(x; 1.3,.8,4) =1— eW [e—4)/.8]"" 
The probability that the air void volume of a specimen is between 5% and 6% is 


P(5<X <6) = F(6; 1.3, .8,4) — F(5; 1.3,.8,4) = e448)" _ el(6-4)/-8)"" 
= 263 — .037 = .226 


Figure 4.29 shows a graph of the corresponding Weibull density function, in which the shaded area 
corresponds to the probability just calculated. 


f@) 
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0.4 4 
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Shaded area = .226 


4 5 6 
Figure 4.29 Weibull density curve with threshold = 4, shape = 1.3, scale = .8 a 
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The Lognormal Distribution 
Lognormal distributions have been used extensively in engineering, medicine, and more recently, 
finance. 


DEFINITION A nonnegative rv X is said to have a lognormal distribution if the rv Y = In(X) has 
a normal distribution. The resulting pdf of a lognormal rv when In(X) is normally 
distributed with parameters and o is 


hae Lt, G) = Ft tine)-1i?/(20°) ae, 


V 2m0x 


Be careful here: the parameters ju and o are not the mean and standard deviation of X but of In(X). The 
mean and variance of X can be shown to be 


2 


E(X) = elit 7/2 V(X) = ete . (e° = 1) 
In Chapter 6, we will present a theoretical justification for this distribution in connection with the 
Central Limit Theorem, but as with other distributions, the lognormal can be used as a model even in 
the absence of such justification. Figure 4.30 illustrates graphs of the lognormal pdf; although a 
normal curve is symmetric, a lognormal curve has a positive skew. 


fx) 


Figure 4.30 Lognormal density curves 


Because In(X) has a normal distribution, the cdf of X can be expressed in terms of the cdf ®(z) of a 
standard normal rv Z. For x > 0, 


In(X) — 
oO (oy 


F(x; u,o) = P(X <x) = Piln(X) < In(x)] = P| 


= p|zs = 4 2 fm es ‘ (4.11) 


oO 


Differentiating F(x; u, o) with respect to x gives the lognormal pdf f(x; u, o) above. 
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Example 4.34 According to the article “Predictive Model for Pitting Corrosion in Buried Oil and 
Gas Pipelines” (Corrosion 2009: 332-342), the lognormal distribution has been reported as the best 
option for describing the distribution of maximum pit depth data from cast iron pipes in soil. The 
authors suggest that a lognormal distribution with = .353 and o = .754 is appropriate for maximum 
pit depth (mm) of buried pipelines. For this distribution, the mean value and variance of pit depth are 


E(X) = e353 + (.754)?/2 — 2383 _ 1 993 
V(X) = 026353) + 754)" | (9 754)" _ 1) = (3.57697) (.765645) = 2.7387 


The probability that maximum pit depth is between 1 and 2 mm is 


P(1<X<2)= arte ) < In(X) < In(2)) 

P(O< In(X) < .693) 

0 — .353 .693 — .353 
=o .754 SZs .754 ) 
( 


45) — @(—.47) = 354 


Figure 4.31 illustrates this probability. 


Sf) 
A 
0.54 


Shaded area = .354 


0.0 


> XxX 


Figure 4.31 Lognormal density curve with uw = .353 and o = .754 


What value c is such that only 1% of all specimens have a maximum pit depth exceeding c? The 
desired value satisfies 


In(c) — 353 
se Rye oaPize 
pe) (z< 754 ) 
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The z critical value 2.33 captures an upper-tail area of .01 (zo; = 2.33) and thus a cumulative area of 
.99. This implies that 


In(c) — 353 


Tala 


from which In(c) = 2.1098 and c = 8.247. Thus 8.247 mm is the 99th percentile of the maximum pit 
depth distribution. ff 


As with the Weibull distribution, a third parameter y can be introduced so that the domain of the 
distribution is x > y rather than x > 0. 


The Beta Distribution 

All families of continuous distributions discussed so far except for the uniform distribution have 
positive density over an infinite interval (although typically the density function decreases rapidly to 
zero beyond a few standard deviations from the mean). The beta distribution provides positive density 
only for X in an interval of finite length. 


DEFINITION A random variable X is said to have a beta distribution with parameters 
a, BP (both positive), A, and B if the pdf of X is 


GER (een Baa 
(052. BA.B) = 5a Foy Toy (aaa) G4) fee 


The case A = 0, B = 1 gives the standard beta distribution. 


Figure 4.32 illustrates several standard beta pdfs. Graphs of the general pdf are similar, except they 
are shifted and then stretched or compressed to fit over [A, B]. Unless « and f are integers, integration 
of the pdf to calculate probabilities is difficult, so either a table of the incomplete beta function or 
software is generally used. 


fx; @ Bs 
5 = 


Figure 4.32 Standard beta density curves 
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The standard beta distribution is commonly used to model variation in the proportion or percentage 
of a quantity occurring in different samples, such as the proportion of a 24-h day that an individual is 
asleep or the proportion of a certain element in a chemical compound. 


The mean and variance of X are 
B—A) 
at 2 ( Yap 


ears: (a+ py (at+ B+1) 


Example 4.35 Project managers often use a method labeled PERT—for program evaluation and 
review technique—to coordinate the various activities making up a large project. (One successful 
application was in the construction of the Apollo spacecraft.) A standard assumption in PERT analysis 
is that the time necessary to complete any particular activity once it has been started has a beta 
distribution with A = the optimistic time (if everything goes well) and B = the pessimistic time (if 
everything goes badly). Suppose that in constructing a single-family house, the time X (in days) 
necessary for laying the foundation has a beta distribution with A = 2, B = 5, « = 2, and f = 3. Then 
al(a + Bp) = .4, so E(X) = 2 + (3)(.4) = 3.2. For these values of « and f, the pdf of X is a simple 
polynomial function. The probability that it takes at most 3 days to lay the foundation is 


ae ae = 


Many software packages can be used to perform probability calculations for the Weibull, lognormal, 
and beta distributions. Interested readers should consult the help menus in those packages. 


Exercises: Section 4.5 (84—98) 


84. The lifetime X (in hundreds of hours) of a Dependence of Vintage ASTM A7 Steel” 


85. 


type of vacuum tube has a Weibull distri- 
bution with parameters « = 2 and f = 3. 
Compute the following: 


a. E(X) and V(X) 
b. P(X < 6) 
ce. PU.5 < X < 6) 


(This Weibull distribution is suggested as a 
model for time in service in “On the 
Assessment of Equipment Reliability: 
Trading Data Collection Costs for Preci- 
sion,” J. Engr. Manuf. 1991: 105-109). 


Many U.S. railroad tracks were built using 
A7 steel, and there is renewed interest in 
the properties of this metal. The article 
“Stress-State, Temperature, and Strain Rate 


(J. Engr. Mater. Tech. 2019) describes, 
among other things, the distribution of 
manganese within A7 steel specimens. The 
authors found that the nearest-neighbor 
distance (NND, in microns) of manganese 
particles along longitudinal planes in A7 
steel follow a Weibull distribution with 
(approximate) parameter values « = 1.18 
and f = 21.61. 


a. What is the probability of observing a 
NND between 20 and 40 um? Less than 
20 um? More than 40 tum? 

b. What are the mean and standard devia- 
tion of this distribution? 

c. What is the median of this distribution? 
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86. 


87. 


88. 


89. 
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In recent years the Weibull distribution has 
been used to model engine emissions of 
various pollutants. Let X denote the amount 
of NO, emission (g/gal) from a randomly 
selected four-stroke engine of a certain 
type, and suppose that X has a Weibull 
distribution with « =2 and f = 10 (sug- 
gested by information in the article 
“Quantification of Variability and Uncer- 
tainty in Lawn and Garden Equipment NO, 
and Total Hydrocarbon Emission Factors,” 
J. Air Waste Manag. Assoc. 2002: 435-— 
448). 
a. What is the cdf of X? 
b. Compute P(X < 10) and 
P(X > 10). 
c. Determine the mean and standard devi- 
ation of X. 
d. Determine the 75th percentile of this 
distribution. 


Let X have a Weibull distribution with the 
pdf from Expression (4.10). Verify that 
= PVC + 1/x). [Hint: In the integral for 
E(X), make the change of variable y = 
(x/B)*, so that x = By”] 


a. In Exercise 84, what is the median life- 
time of such tubes? [Hint: Use Expres- 
sion (4.10).] 

b. If X has a Weibull distribution with the 
cdf from Expression (4.10), obtain a 
general expression for the (100p)th 
percentile of the distribution. 

c. In Exercise 86, engines whose NO, 
emissions exceed a threshold of t g/gal 
must be replaced to meet new environ- 
mental regulations. For what value of 
t would 10% of these engines require 
replacement? 


Let X denote the ultimate tensile strength 
(ksi) at —200° of a randomly selected steel 
specimen of a certain type that exhibits 
“cold brittleness” at low temperatures. 
Suppose that X has a Weibull distribution 
with « = 20 and f = 100. 


a. What is the probability that X is at most 
105 ksi? 


b. If specimen after specimen is selected, 
what is the long-run proportion having 
strength values between 100 and 105 
ksi? 

c. What is the median of the strength dis- 
tribution? 


90. The authors of the article “Study on the 


91. 


92. 


Life Distribution of Microdrills” (J. Engr. 
Manuf. 2002: 301-305) suggested that a 
reasonable probability model for drill life- 


time was a lognormal distribution with 
w=4.5 and o =.8. 


a. What are the mean value and standard 
deviation of lifetime? 

b. What is the probability that lifetime is at 
most 100? 

c. What is the probability that lifetime is at 
least 200? Greater than 200? 


The article referenced in Exercise 85 also 
considered the distribution of areas (square 
microns) of single manganese particles in 
through thickness planes of A7 steel. The 
authors determined that a lognormal distri- 
bution with parameters w= 1.513 and 
o = 1.006 to be an appropriate model for 
these manganese particle areas. 


a. Determine the mean and standard devi- 
ation of this distribution. 

b. What is the probability of observing a 
particle area less than 10 square 
microns? Between 10 and 20 wm?*? 

c. Determine the probability of observing a 
manganese particle area less than the 
mean value. Why does this probability 
not equal .5? 


a. Use Equation (4.11) to write a formula 
for the median jt of the lognormal dis- 
tribution. What is the median for the area 
distribution of the previous exercise? 

b. Recalling that z, is our notation for the 
1000. — «) percentile of the standard 
normal distribution, write an expression 
for the 100(1 — «) percentile of the 
lognormal distribution. In the previous 
exercise, what value will particle area 
exceed only 5% of the time? 
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93. 


94. 


95. 


A theoretical justification based on a 

material failure mechanism underlies the 

assumption that ductile strength X of a 

material has a lognormal distribution. 

Suppose the parameters are w=5 and 

o=.l. 

. Compute E(X) and V(X). 

. Compute P(X > 125). 

. Compute P(110 < X < 125). 

. What is the value of median ductile 

strength? 

e. If ten different samples of an alloy steel 
of this type were subjected to a strength 
test, how many would you expect to 
have strength of at least 125? 

f. If the smallest 5% of strength values 
were unacceptable, what would the 
minimum acceptable strength be? 


The article “The Statistics of Phytotoxic Air 
Pollutants” (J. Roy. Statist Soc. 1989: 183- 
198) suggests the lognormal distribution as 
a model for SO, concentration above a 
forest. Suppose the parameter values are 
w= 1.9 ando =.9. 


a 
b 
c 
d 


a. What are the mean value and standard 
deviation of concentration? 

b. What is the probability that concentration 
is at most 10? Between 5 and 10? 


What condition on « and f is necessary for 
the standard beta pdf to be symmetric? 


4.6 Probability Plots 


96. 


97. 


98. 
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Suppose the proportion X of surface area in 
a randomly selected quadrat that is covered 
by a certain plant has a standard beta dis- 
tribution with « = 5 and f = 2. 


a. Compute F(X) and V(X). 

b. Compute P(X < .2). 

c. Compute P(.2 < X < A). 

d. What is the expected proportion of the 
sampling region not covered by the plant? 


Let X have a standard beta density with 

parameters o and f. 

a. Verify the formula for E(X) given in the 
section. 

b. Compute E[(1 — XY”). If X represents 
the proportion of a substance consisting 
of a particular ingredient, what is the 
expected proportion that does not con- 
sist of this ingredient? 


Stress is applied to a 20-in. steel bar that is 
clamped in a fixed position at each end. Let 
Y = the distance from the left end at which 
the bar snaps. Suppose Y/20 has a standard 
beta distribution with E(Y)=10 and 
V(Y) = 100/7. 


a. What are the parameters of the relevant 
standard beta distribution? 

b. Compute P(8 < Y < 12). 

c. Compute the probability that the bar 
snaps more than 2 in. from where you 
expect it to snap. 


An investigator will often have obtained a numerical sample consisting of n observations and wish to 
know whether it is plausible that this sample came from a population distribution of some particular 
type (e.g., from a normal distribution). For one thing, many formal procedures from statistical 
inference are based on the assumption that the population distribution is of a specified type. The use 
of such a procedure is inappropriate if the actual underlying probability distribution differs greatly 
from the assumed type. Additionally, understanding the underlying distribution can sometimes give 
insight into the physical mechanisms involved in generating the data. An effective way to check a 
distributional assumption is to construct what is called a probability plot. The basis for our con- 
struction is a comparison between percentiles of the sample data and the corresponding percentiles of 
the assumed underlying distribution. 
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Sample Percentiles 

The details involved in constructing probability plots differ a bit from source to source. Roughly 
speaking, sample percentiles are defined in the same way that percentiles of a population distribution 
are defined. The sample 50th percentile (i.e., the sample median) should separate the smallest 50% of 
the sample from the largest 50%, the sample 90th percentile should be such that 90% of the sample 
lies below that value and 10% lies above, and so on. Unfortunately, we run into problems when we 
actually try to compute the sample percentiles for a particular sample of n observations. If, for 
example, n = 10, then we can split off 20% or 30% of the data, but there is no value that will split off 
exactly 23% of these ten observations. To proceed further, we need an operational definition of 
sample percentiles (this is one place where different people and different software packages do 
slightly different things). 

Statistical convention states that when n is odd, the sample median is the middle value in the 
ordered list of sample observations, for example, the sixth-largest value when n = 11. This amounts 
to regarding the middle observation as being half in the lower half of the data and half in the upper 
half. Similarly, suppose n = 10. Then if we call the third-smallest value the 25th percentile, we are 
regarding that value as being half in the lower group (consisting of the two smallest observations) and 
half in the upper group (the seven largest observations). This leads to the following general definition 
of sample percentiles. 


DEFINITION Order the n sample observations from smallest to largest. Then the ith-smallest 
observation in the list is taken to be the sample [100(@ — .5)/n]th percentile. 


For example, if n = 10, the percentages corresponding to the ordered sample observations are 
1001 — .5)/10 = 5%, 100(2 — .5)/10 = 15%, 25%, ..., and 10010 — .5)/10 = 95%. That is, the 
smallest observation is the sample 5th perecentile, the next-smallest value is the sample 15th per- 
centile, and so on. All other percentiles could then be determined by interpolation; e.g., the sample 
10th percentile would then be halfway between the 5th percentile (smallest sample observation) and 
the 15th percentile (second-smallest observation) of the n = 10 values. For the purposes of a prob- 
ability plot, such interpolation will not be necessary, because a probability plot will be based only on 
the percentages 100(i — .5)/n corresponding to the n sample observations. 


A Probability Plot 

We now wish to determine whether our sample data could plausibly have come from some particular 
population distribution (e.g., a normal distribution with ~ = 10 and o = 3). If the sample was actually 
selected from the specified distribution, the sample percentiles (ordered sample observations) should 
be reasonably close to the corresponding population distribution percentiles. That is, fori = 1, 2, ..., 
n there should be reasonable agreement between the ith-smallest sample observation and the theo- 
retical [100@ — .5)/n]th percentile for the specified distribution. Consider the (sample percentile, 
population percentile) pairs—that is, the pairs 


ith smallest sample [100(i — .5)/n|th percentile 
observation of the population distribution 


for i = 1, ..., n. Each such pair can be plotted as a point on a two-dimensional coordinate system. If 
the sample percentiles are close to the corresponding population distribution percentiles, the first 
number in each pair will be roughly equal to the second number, and the plotted points will then fall 
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close to a 45° line passing through (0, 0). Substantial deviations of the plotted points from this 45° 
line suggest that the assumed distribution might be wrong. 


Example 4.36 The value of a physical constant is known to an experimenter. The experimenter 
makes n = 10 independent measurements of this value using a measurement device and records the 
resulting measurement errors (error = observed value — true value). These observations appear in the 
accompanying table. 


Percentage 5 15 25 35 45 
Sample observation | —1.91 —1,25 —75 —.53 .20 
z percentile —1.645 —-1.037 —.675 —385 —126 
Percentage 55 65 75 85 95 
Sample observation 35 72 87 1.40 1.56 
z percentile 126 385 .675 1.037 1.645 


Is it plausible that the random variable measurement error has a standard normal distribution? The 
needed standard normal (z) percentiles are also displayed in the table and were determined as follows: 
the 5th percentile of the distribution under consideration, N(0,1), is such that ®(z) = .05. From 
software or Appendix Table A.3, the solution is roughly z = —1.645. The other nine population 
(z) percentiles were found in a similar fashion. 

Thus the points in the probability plot are (-1.91, —1.645), (-1.25, —1.037), ..., and (1.56,1.645). 
Figure 4.33 shows the resulting plot. Although the points deviate a bit from the 45° line, the pre- 
dominant impression is that this line fits the points reasonably well. The plot suggests that the 
standard normal distribution is a realistic probability model for measurement error. 


z percentile 
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Figure 4.33 Plots of pairs (observed value, z percentile) for the data of Example 4.36 a 
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An investigator is typically not interested in knowing whether a particular probability distribution, 
such as the normal distribution with 4 = 0 and o = 1 or the exponential distribution with 2 = .1, is a 
plausible model for the population distribution from which the sample was selected. Instead, the 
investigator will want to know whether some member of a family of probability distributions provides 
a plausible model—the family of normal distributions, the family of exponential distributions, the 
family of Weibull distributions, and so on. The values of any parameters are usually not specified at the 
outset. If the family of Weibull distributions is under consideration as a model for lifetime data, the 
issue is whether there are any values of the parameters « and f for which the corresponding Weibull 
distribution gives a good fit to the data. Fortunately, it is almost always the case that just one 
probability plot will suffice for assessing the plausibility of an entire family. If the plot deviates 
substantially from a straight line, but not necessarily the 45° line, no member of the family is plausible. 

To see why, let’s focus on a plot for checking normality. As mentioned earlier, such a plot can be 
very useful in applied work because many formal statistical procedures are appropriate (i.e., give 
accurate inferences) only when the population distribution is at least approximately normal. These 
procedures should generally not be used if a normal probability plot shows a very pronounced 
departure from linearity. The key to constructing an omnibus normal probability plot is the rela- 
tionship between standard normal (z) percentiles and those for any other normal distribution, which 
was presented in Section 4.3: 


No) Pe tited = wp-+to- (corresponding z percentile) 
If each sample observation was exactly equal to the corresponding M(u, co) percentile, then the pairs 
(observation, + o - [z percentile]) would fall on the 45° line, y = x. But since pp + oz is itself a linear 
function, the pairs (observation, z percentile) would also fall on a straight line, just not the line with 
slope 1 and y-intercept 0. (The latter pairs would pass through the line z = x/ao — p/o, but the equation 
itself isn’t important.) 


DEFINITION A plot of the n pairs 
(ith-smallest observation, [100(i — .5) /n]th z percentile) 


on a two-dimensional coordinate system is called a normal probability plot. If the 
sample observations are in fact drawn from a normal distribution, then the points 
should fall close to a straight line (although not necessarily a 45° line). Thus a plot 
for which the points fall close to some straight line suggests that the assumption of a 
normal population distribution is plausible. 


Example 4.37 The accompanying sample consisting of n = 20 observations on dielectric break- 
down voltage of a piece of epoxy resin appeared in the article “Maximum Likelihood Estimation in 
the 3-Parameter Weibull Distribution” (IEEE Trans. Dielectrics Electr. Insul. 1996: 43-55). Values 
of (i — .5)/n for which z percentiles are needed are (1 — .5)/20 = .025, (2 — 5)/20 = .075, ..., and 
975. 


Observation 24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 27.94 
z percentile 1.96 1.44 1.15 193: 16 .60 AS 32 19 .06 


Observation 27.98 28.04 28.28 28.49 28.50 28.87 29.11 29.13 29.50 30.88 
z percentile .06 19 32 45 .60 16 93 1.15 1.44 1.96 
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Figure 4.34 shows the resulting normal probability plot. The pattern in the plot is quite straight, 
indicating it is plausible that the population distribution of dielectric breakdown voltage is normal. 
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Figure 4.34 Normal probability plot for the dielectric breakdown voltage sample a 


There is an alternative version of a normal probability plot in which the z percentile axis is 
replaced by a nonlinear probability axis. The scaling on this axis is constructed so that plotted points 
should again fall close to a line when the sampled distribution is normal. Figure 4.35 shows such a 
plot from Minitab for the breakdown voltage data of Example 4.37. Here the z values are replaced by 
the corresponding normal percentiles. The plot remains the same, and it is just the labeling of the axis 
that changes. Minitab and various other software packages use the refinement (i — .375)/(n + .25) of 
the expression (i — .5)/n in order to get a better approximation to what is expected for the ordered 
values from the standard normal distribution. 
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Figure 4.35 Normal probability plot of the breakdown voltage data from Minitab 
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Departures from Normality 
A nonnormal population distribution can often be placed in one of the following three categories: 


1. It is symmetric and has “lighter tails” than does a normal distribution; that is, the density curve 
declines more rapidly out in the tails than does a normal curve. 

2. It is symmetric and heavy-tailed compared to a normal distribution. 

3. It is skewed. 


A uniform distribution is light-tailed, since its density function drops to zero outside a finite 
interval. The density function f(x) = 1/[m(1 + x°)], for co <x<oo, is one example of a heavy-tailed 
distribution, since 1/(1 + x*) declines much less rapidly than does en /2, Lognormal and Weibull 
distributions are among those that are skewed. When the points in a normal probability plot do not 
adhere to a straight line, the pattern will frequently suggest that the population distribution is in a 
particular one of these three categories. 

Figure 4.36 illustrates typical normal probability plots corresponding to the three situations 
above. If the sample was selected from a light-tailed distribution, the largest and smallest observations 
are usually not as extreme as would be expected from a normal random sample. Visualize a straight 
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Figure 4.36 Probability plots that suggest a nonnormal distribution: 
(a) a plot consistent with a light-tailed distribution; (b) a plot consistent with 
a heavy-tailed distribution; (b) a plot consistent with a (positively) skewed distribution 
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line drawn through the middle part of the plot; points on the far right tend to be above the line 
(z percentile > observed value), whereas points on the left end of the plot tend to fall below the 
straight line (z percentile < observed value). The result is an S-shaped pattern of the type pictured in 
Figure 4.36a. For sample observations from a heavy-tailed distribution, the opposite effect will occur, 
and a normal probability plot will have an S shape with the opposite orientation, as in Figure 4.36b. If 
the underlying distribution is positively skewed (a short left tail and a long right tail), the smallest 
sample observations will be larger than expected from a normal sample and so will the largest 
observations. In this case, points on both ends of the plot will fall below a straight line through the 
middle part, yielding a curved pattern, as illustrated in Figure 4.36c. For example, a sample from a 
lognormal distribution will usually produce such a pattern; a plot of (In(observation), z percentile) 
pairs should then resemble a straight line. 

Even when the population distribution is normal, the sample percentiles will not coincide exactly 
with the theoretical percentiles because of sampling variability. How much can the points in the 
probability plot deviate from a straight line pattern before the assumption of population normality is 
no longer plausible? This is not an easy question to answer. Generally speaking, a small sample from 
a normal distribution is more likely to yield a plot with a nonlinear pattern than is a large sample. The 
book Fitting Equations to Data (see the bibliography) presents the results of a simulation study in 
which numerous samples of different sizes were selected from normal distributions. The authors 
concluded that there is typically greater variation in the appearance of the probability plot for sample 
sizes smaller than 30, and only for much larger sample sizes does a linear pattern generally pre- 
dominate. When a plot is based on a small sample size, only a very substantial departure from 
linearity should be taken as conclusive evidence of nonnormality. A similar comment applies to 
probability plots for checking the plausibility of other types of distributions. 


Beyond Normality 

Consider a family of probability distributions involving two parameters, 0, and 03, and let F(x; 01, 02) 
denote the corresponding cdf. The family of normal distributions is one such family, with 0; = p, 
02 = 6, and F(x; fu, 0) = ®[(x — »)/o]. Another example is the Weibull family, with 0; = «, 02 = B, 
and 


F@; #8) =1-e 


Still another family of this type is the gamma family, for which the cdf is an integral involving the 
incomplete gamma function that cannot be expressed in any simpler form. 

The parameters 0; and 0 are said to be location and scale parameters, respectively, if F(x; 0), 02) 
is a function of (x — 0,)/02. The parameters 4 and o of the normal family are location and scale 
parameters, respectively. Changing w shifts the location of the bell-shaped density curve to the right 
or left, and changing ¢ amounts to stretching or compressing the measurement scale (the scale on the 
horizontal axis when the density function is graphed). Another example is given by the cdf 


_ p(t} )/05 
F(x; 01,02)=1-—e°% ' — 00 <x<00 


A random variable with this cdf is said to have an extreme value distribution. It is used in applications 
involving component lifetime and material strength. 
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Although the form of the extreme value cdf might at first glance suggest that 0, is the point of 
symmetry for the density function, and therefore the mean and median, this is not the case. Instead, 
P(X < 0,) = F(0;; 0), 02) = 1 — e ' = .632, and the density function f(x; 01, 02) = F'(x; 01, 02) is 
negatively skewed (a long lower tail). Similarly, the scale parameter 02 is not the standard deviation 
(u = 0, — 577205 and o = 1.28362). However, changing the value of 0; does change the location of 
the density curve, whereas a change in 02 rescales the measurement axis. 

The parameter ( of the Weibull distribution is a scale parameter. However, « is not a location 
parameter but instead is called a shape parameter. The same is true for the parameters « and f of the 
gamma distribution. In the usual form, the density function for any member of either the gamma or 
Weibull distribution is positive for x > 0 and zero otherwise. A location (or shift) parameter can be 
introduced as a third parameter (we noted this for the Weibull distribution in Section 4.5) to shift the 
density function so that it is positive if x > y and zero otherwise. 

When the family under consideration has only location and scale parameters, the issue of whether 
any family member is a plausible population distribution can be addressed by a single probability plot. 
This is exactly what we did to obtain an omnibus normal probability plot. One first obtains the 
percentiles of the standard distribution, the one with 0, = 0 and 0, = 1, for percentages 100(i — .5)/n 
(i= 1, ..., n). The n (observation, standardized percentile) pairs give the points in the plot. 

Somewhat surprisingly, this methodology can be applied to yield an omnibus Weibull probability 
plot. The key result is that if X has a Weibull distribution with shape parameter « and scale parameter 
B, then the transformed variable In(X) has an extreme value distribution with location parameter 
0, = In(f) and scale parameter 02 = 1/x (see Exercise 154). Thus a plot of the (In(observation), 
extreme value standardized percentile) pairs that shows a strong linear pattern provides support for 
choosing the Weibull distribution as a population model. 


Example 4.38 As climate change continues, more areas experience extreme wind events, which 
both safety engineers and FEMA must accurately model because they affect home damage. Engineers 
frequently use the Weibull distribution to model maximum wind speed in a given region. The article 
“Estimation of Extreme Wind Speeds by Using Mixed Distributions” (Engr. Invest. Technol. 2013: 
153-162) provides measurements of X = maximum wind speed (m/s) for 45 stations in the 
Netherlands. A Weibull probability plot can be constructed by plotting the logarithms of those 
observations against the (100p)th percentiles of the extreme value distribution for p = (1 — .5)/45, 
(2 — .S)/45, ..., (45 — .5)/45. The (100p)th percentile 7(p) satisfies 


—elllP) 
p= F(n(p)) =1-e 


from which y(p) = In[—In(1 — p)]. 


Percentile x In(x) Percentile x In(x) 
4.49 17.7 2.87 —0.30 25.8 3.25 
3.38 18.9 2.94 —0.24 25.8 3.25 
—2.86 20.9 3.04 -0.18 25.9 3.25 
2.51 21.4 3.06 -0.12 25.9 3.25 
2.25 21.7 3.08 —0.06 26.0 3.26 
—2.04 22.3 3.10 0.00 26.2 3.27 
1.86 22.6 3.12 0.06 26.2 3.27 
-1.70 22.8 3.13 0.12 26.4 3.27 
-1.56 23.0 3.14 0.19 26.6 3.28 
1.44 234 3.14 0.25 26.7 3.28 


(continued) 
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Percentile x In(x) Percentile x In(x) 
-1.33 23.2 3.14 0.31 26.8 3.29 
—1.22 23.3 3.15 0.38 26.9 3.29 
—1.12 23.1 3.17 0.44 26.9 3.29 
-1.03 24.0 3.18 0.51 27.0 3.30 
0.94 24.1 3.18 0.58 27.3 3.31 
—0.86 24.1 3.18 0.66 28.0 3.33 
-0.78 24.2 3.19 0.74 28.1 3.34 
-0.71 24.4 3.19 0.83 28.8 3.36 
—0.64 25.2 3.23 0.94 29.2 3.37 
0.57 25.6 3.24 1.06 29.4 3.38 
—0.50 25.6 3.24 1.22 30.0 3.40 
-0.43 25.7 3.25 1.50 31.1 3.44 
-0.37 25.7 3.25 


The pairs (2.87, -4.49), (2.94, —3.38), ..., (3.44, 1.50) are plotted as points in Figure 4.37. The 
straightness of the plot argues strongly that In(X) is compatible with an extreme value distribution, 
and so X itself can be well-modeled by a Weibull distribution. 


Percentile 


Figure 4.37 A Weibull probability plot of the maximum wind speed data 


It should be noted that many statistical software packages have built-in Weibull probability plot 
functionality that does not require the user to transform the data or calculate the extreme value 
percentiles. fa 


The gamma distribution is an example of a family involving a shape parameter for which there is 
no transformation into a distribution that depends only on location and scale parameters. Construction 
of a probability plot necessitates first estimating the shape parameter from sample data (some methods 
for doing this are described in Chapter 7). 

Sometimes an investigator wishes to know whether the transformed variable X° has a normal 
distribution for some value of 6 (by convention, 0 = 0 is identified with the logarithmic transfor- 
mation, in which case X has a lognormal distribution). The book Graphical Methods for Data 
Analysis (see the bibliography) discusses this type of problem as well as other refinements of 
probability plotting. 
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Formal Tests of a Distributional Fit 

Given the limitations of probability plots, there is need for an alternative. Statisticians have developed 
several formal procedures for assessing whether sample data could plausibly have come from a 
normally distributed population. The Ryan-Joiner test quantifies on a zero-to-one scale how closely 
the pattern of points in a normal probability plot adheres to a straight line, with higher values 
corresponding to a more linear pattern. If this quantified value is too low, the test casts doubt on 
population normality. (In the formal language of Chapter 9, the test “rejects” the claim of a normal 
population if the probability plot is sufficiently nonlinear.) The Ryan—Joiner measure appears in the 
top-right corner of Figure 4.35 (RJ = 0.988); its very high value on a [0, 1] scale implies that 
population normality is plausible. The Shapiro—Wilk test proceeds similarly, although it quantifies 
linearity somewhat differently, and is more ubiquitous among statistical software packages: R, SAS, 
Stata, SPSS, and JMP all include the Shapiro—Wilk test among their options. 

The Ryan—Joiner and Shapiro—Wilk tests are specialized to assessing normality; i.e., they are not 
designed to detect conformance with other distributions (gamma, Weibull, etc.). The Anderson— 
Darling (AD) test and the Kolmogorov—Smirnov (KS) test can both be applied to a wider collection of 
distributions. Each of these latter tests is based on comparing the cdf F(x) of the theorized distribution 
(e.g., the Weibull cdf) to the “empirical” cdf F,,(x) of the sample data, defined for any real 
number x by 


F,,(x) = the proportion of the sample values{x,,...,x,} that are <x 


If F(x) and F,,(x) are “too far apart” in some sense, this indicates that the sample data is incom- 
patible with the theorized population distribution (and so that theory should be “rejected”’). The AD 
and KS tests differ in how they quantify the disparity between F(x) and F,,(x). (Specific to assessing 
normality, a 2011 article in the Journal of Statistical Modeling and Analysis found that the Shapiro— 
Wilk test has greater capability of detecting normality violations than either the AD or KS tests.) 


Exercises: Section 4.6 (99-109) 


99. The accompanying normal probability plot (km/h) while swinging a driver was deter- 
was constructed from a sample of 30 mined for each one, resulting in the fol- 
readings on tension for mesh _ screens lowing data (“Hip Rotational Velocities 
behind the surface of video display tubes. during the Full Golf Swing,” J. Sports Sci. 
Does it appear plausible that the tension Med. 2009: 296-299): 
distribution is normal? Explain. 

; 69.0 69.7 72.7 80.3 81.0 
apoio 85.0 86.0 86.3 86.7 87.7 
89.3 90.7 91.0 92.5 93.0 


The corresponding z percentiles are 


1.83 1.28 0.97 0.73 0.52 
—0.34 —-0.17 0.0 0.17 0.34 
0.52 0.73 0.97 1.28 1.83 


Construct a normal probability plot and a 
200 250 300 350 dotplot. Is it plausible that the population 
100. A sample of 15 female collegiate golfers distribution is normal? 
was selected and the clubhead velocity 
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101. 


102. 


Observation 


Pi 


Observation 


Pi 


Observation 


Pi 


103. 


104. 


Construct a normal probability plot for the 
following sample of observations on coat- 
ing thickness for low-viscosity paint 
(“Achieving a Target Value for a Manu- 
facturing Process: A Case Study,” J. Qual. 
Tech. 1992: 22-26). Would you feel com- 
fortable estimating population mean thick- 
ness using a method that assumed a normal 
population distribution? Explain. 


.83 
1.29 
1.65 


88 
1.31 
1.71 


88 
1.48 
1.76 


1.04 
1.49 
1.83 


1.09 
1.59 


1.12 
1.62 


The article “A Probabilistic Model of 
Fracture in Concrete and Size Effects on 
Fracture Toughness” (Mag. Concrete Res. 
1996: 311-320) gives arguments for why 
fracture toughness in concrete specimens 
should have a Weibull distribution and 
presents several histograms of data that 
appear well fit by superimposed Weibull 
curves. Consider the following sample of 
n = 18 observations on toughness for high- 
strength concrete (consistent with one of 
the histograms); values of p; = (@i — 5)/18 
are also given. 


AT 
0278 
77 
3611 
86 
6944 


58 
0833 
79 
4167 
89 
-7500 


.65 
1389 
80 
4722 
91 
8056 


69 
1944 
81 
5278 
95 
8611 


72 
.2500 
82 
5833 
1.01 
.9167 


74 
3056 
84 
6389 
1.04 
9722 


Construct a 
comment. 
Construct a normal probability plot for the 
escape time data given in Exercise 46 of 
Chapter 1. Does it appear plausible that 
escape time has a normal distribution? 
Explain. 

The article “Reducing Uncertainty of 
Design Floods of Two-Component Mixture 
Distributions by Utilizing Flood Timescale 
to Classify Flood Types in Seasonally 
Snow Covered Region” (J. Hydrol. 2019: 


Weibull probability plot and 
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588-608) reports the accompanying data on 
annual precipitation (mm/yr) at 34 water- 


sheds in Norway. 


527.9 598.2 668.5 1136.6 1160.1 
1177.0 1512.7 1542.5 1642.6 2383.8 
2628.5 2671.5 697.7 859.0 884.3 
1182.3 1195.6 1212.8 1872.1 1976.3 
2082.9 2872.3 3221.6 3430.2 894.3 
1030.7 1035.5 1294.2 1441.7 1475.4 
2266.3 2337.0 2365.0 4029.7 


a. Construct a normal probability plot. Is 
normality plausible? 

b. Construct a Weibull probability plot. Is 
the Weibull distribution family plausi- 
ble? 


105. Construct a probability plot that will allow 
you to assess the plausibility of the log- 
normal distribution as a model for the 
nitrogen data of Example 1.17. 

106. The accompanying observations are pre- 
cipitation values during March over a 30- 
year period in Minneapolis-St. Paul. 

0.77 1.20 3.00 1.62 2.81 2.48 

1.74 0.47 3.09 1.31 1.87 0.96 

0.81 1.43 1.51 0.32 1.18 1.89 

1.20 3.37 2.10 0.59 1.35 0.90 
1.95 2.20 0.52 0.81 4.75 2.05 

a. Construct and interpret a normal proba- 
bility plot for this data set. 

b. Calculate the square root of each value 
and then construct a normal probability 
plot based on this transformed data. 
Does it seem plausible that the square 
root of precipitation is normally 
distributed? 

c. Repeat part (b) after transforming by 
cube roots. 

107. The accompanying data set consists of 


observations on shower-flow rate (L/min) 
for a sample of n = 129 houses in Perth, 
Australia (“An Application of Bayes 
Methodology to the Analysis of Diary 
Records in a Water Use Study,” J. Amer. 
Statist. Assoc. 1987: 705-711): 
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46 123 71 70 40 92 67 69 1L5 5.1 normal plot is a plot of the (® '[(p; + 1/2], w)) 
112 105 143 80 88 64 51 56 96 7.5 pairs, where p; = (i — 5)/n. The virtue of 


Po Oe Be ee Loe oe eae YO this plot is that small or large outliers in the 
8.3 65 76 93 92 7.3 5.0 63 138 62 


54 48 75 60 69 108 75 66 50 33 original sample will now appear only at the 
76 39 119 22 150 72 61 153 189 7.2 upper end of the plot rather than at both ends. 
54 55 43 9.0 127 113 74 5.0 3.5 8.2 Construct a half-normal plot for the follow- 
a a i : ie is = o A ae ing sample of measurement errors, and 
108 155 75 64 34 55 66 59 15.0 96 comment: —3.78, —1.27, 1.44, —.39, 12.38, 
78 70 69 41 36 119 3.7 5.7 68 113 —43.40, 1.15, —3.96, —2.34, 30.84. 


93 96 104 9.3 69 98 91 106 45 62 109. The following failure time observations 
Bee Nee RM a SG RR TB 9G (1000s of hours) resulted from accelerated 
life testing of 16 integrated circuit chips of 


Construct a normal probability plot of this a certain type: 


data and comment. 


108. Let the ordered sample observations be 82.8 11.6 359.5 502.5 307.8 179.7 
denoted by div You vos De Or being the 22283 KB MD. D1_ 212 
smallest and y, the largest). Our suggested 
check for normality is to plot the 
(® '[(i — .5)/n], y,) pairs. Suppose we 
believe that the observations come from a 
distribution with mean 0, and let wy, ..., w, 
be the ordered absolute values of the x;’s. 
A half-normal plot is a probability plot of 
the w;’s. That is, since P(\Z| < w)= 
Pew < Z < w)=20~)- 1, a half- 


Use the corresponding percentiles of the 
exponential distribution with 2 = 1 to con- 
struct a probability plot. Then explain why 
the plot assesses the plausibility of the 
sample having been generated from any 
exponential distribution. 


4.7. Transformations of a Random Variable 


Often we need to deal with a transformation Y = g(X) of the random variable X. For example, 
g(X) could be a simple change of time scale: if X is the time to complete a task in minutes, then 
Y = 60X is the completion time expressed in seconds. How can we get the pdf of Y from the pdf of X? 
Consider first a simple example. 


Example 4.39 The interval X in minutes between calls to a 911 center is exponentially distributed 
with mean 2 min, so its pdf f(x) = .5e>* for x > 0. In order to get the pdf of Y = 60X, we first 
obtain its cdf: 


Fy(y) = P(Y <y) = P(60X <y) = P(X <y/60) = Fx(y/60) 
y/60 
= / Se dx = 1 — 79/120 
0 
Differentiating this with respect to y gives fy) = (1/120)e~”'”° for y > 0. We see that the distribution 
of Y is exponential with mean 120 s (2 min). 
There is nothing special here about the mean 2 and the multiplier 60. It should be clear that if we 


multiply an exponential random variable with mean y by a positive constant c we get another 
exponential random variable with mean cu. i 


Sometimes it isn’t possible to evaluate the cdf in closed form. Could the pdf of Y be obtained 
without evaluating the integral? Yes, thanks to the following theorem. 
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TRANSFORMATION _ Let X have pdf f(x) and let Y = g(X), where g is monotonic (either strictly 


THEOREM increasing or strictly decreasing) on the set of all possible values of X, so it 
has an inverse function X = g~!(Y) = A(Y). Assume that h has a derivative 

h'(y). Then 
Fr(y) = fx(A(y)) - |A')| (4.12) 


Proof Here is the proof assuming that g is monotonically increasing. The proof for g monotonically 
decreasing is similar. First find the cdf of Y: 


Fy(y) = P(Y <y) = P(g(X) <y) = P(X <h(y)) = Fx(hQ)) 


The third equality above, wherein g(X) < y is true iff X < g '(y) = AG), relies on g being a 
monotonically increasing function. Now differentiate the cdf with respect to y, using the Chain Rule: 


d d 


fr(y) = Br) = Bixho)) = Fy(h(y)) -A'(y) = fe(h(y)) -A'(y) 


The absolute value on the derivative in (4.12) is needed only in the other case where g is decreasing. 
The set of possible values for Y is obtained by applying g to the set of possible values for X. Mf 


Example 4.40 Let’s apply the Transformation Theorem to the situation introduced in Example 4.39. 
There Y = 9(X) = 60X and X = A(Y) = Y/60. 


1 1 
— h h = —.5x] * | —y/120 
Fry) = FxlAQ)|Ih | = Seo) = gape y>0 
This matches the pdf of Y derived through the cdf in Example 4.39. a 


Example 4.41 Let X ~ Unif[0, 1], so f(x) =1 for 0 < x < 1, and define a new variable 


Y = 2\X. The function g(x) = 2,/x is monotone on [0, 1], with inverse x = h(y) = y’/4. Apply the 
Transformation Theorem: 


fits) =A WON =) =3 oxys2 


The range 0 < y < 2 comes from the fact that y = 2,/x maps [0, 1] to [0, 2]. A graphical repre- 
sentation may help in understanding why the transform Y = 2\/X yields f(y) = y/2 if xX ~ Unif[0, 1]. 
Figure 4.38a shows the uniform distribution with [0, 1] partitioned into ten subintervals. In Fig- 
ure 4.38b the endpoints of these intervals are shown after transforming according to y = 2,/x. The 
heights of the rectangles are arranged so each rectangle still has area .1, and therefore the probability in 
each interval is preserved. Notice the close fit of the dashed line, which has the equation f(y) = y/2. 
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a k@) b #0) 
A A 
1.0 1.0 ra 
8 8 
6 6 et | 
4 4 Tt | 
2 2 ca 
0 | lp x 0 cal >y 
0 o 1.0 15 2.0 0 5 1.0 15 2.0 ~ 
Figure 4.38 The effect on the pdf if X is uniform on [0, 1] and Y = 2/X |_| 


Example 4.42 The variation in a certain electrical current source X (in milliamps) can be modeled 
by the pdf 


fx(x) = 1.25—.25x 2<x<4 


If this current passes through a 220-Q resistor, the resulting power Y (in microwatts) is given by the 
expression Y = 220X°. The function y=gxX)= 220x7 is monotonically increasing on the range of X, 
the interval [2, 4], and has inverse function x = h(y) = g ‘(y) = ,/y/220. (Notice that g(x) is a 
parabola and thus not monotone on the entire real number line, but for the purposes of the Trans- 
formation Theorem g(x) only needs to be monotone on the range of the rv X.) Apply (4.12): 


fr(y) = fx(AQ)) - |h'(y)| 
= ful v/97220) | 9/220 


1 5 1 
= (1.25 — .25,/y/220) - = 
( ¥/220) «> Ta5G5 = B/TOy 1760 


The set of possible Y values is determined by substituting x = 2 and x = 4 into g(x) = 220x7; the 
resulting range for Y is [880, 3520]. Therefore, the pdf of Y = 220X? is 


5 1 
—-—— 880<y<3520 
f=2 se 
0 otherwise 


The pdfs of X and Y appear in Figure 4.39. 
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a b 
aco) Sy(y) 
ry ry 

0.8 4 0.0008 + 
0.6 4 0.0006 + 
0.4 4 0.0004 + 
0.2 4 0.0002 + 

0 ; : ; >x 0 ; ; : = >y 

2 3 4 0 880 1760 2640 3520 
Figure 4.39 pdfs from Example 4.42: (a) pdf of X; (b) pdf of Y | 


The Transformation Theorem requires a monotonic transformation, but there are important 
applications in which the transformation is not monotone. Nevertheless, it may be possible to use the 
theorem anyway with a little trickery. 


Example 4.43 In this example, we start with a standard normal random variable Z, and we transform 
to Y = Z’. (The squares of normal random variables are important because the sample variance is built 
from squares, and we will subsequently need the distribution of the sample variance.) This is not 
monotonic over the interval for Z, (—00, 00). However, consider the transformation U = |Z]. Because 
Z has a symmetric distribution, the pdf of U is fy(u) = fu) + fu) = 2 f-(u). Don’t despair if this is 
not intuitively clear, because we’ll verify it shortly. For the time being, assume it to be true. Then 
Y=Z> = |Z) =U’, and the transformation in terms of U is monotonic because its set of possible 
values is (0, 00). Thus we can use the Transformation Theorem with A(y) = gt 


fr(y) = fulhO)|A'()| = 2fzlhAO) Ih) 
he 


2 


= 2 proory 1 
V2n V2ny 
You were asked to believe intuitively that fi,(u) = 2 f,(u). Here is a little derivation that works as long 
as the distribution of Z is symmetric about 0. If u > 0, 


1/2 e/? y>0 


Fy(u) = P(U <u) = P(|Z| <u) = P(-w<Z <u) =2P(0<Z<u) 
= 2[Fz(u) — Fz(0)}. 


Differentiating this with respect to u gives fy(u) = 2 fx(u). & 


Example 4.44 Sometimes the Transformation Theorem cannot be used at all, and you need to use 
the cdf. Let f(x) = @ + 1/8, -1 < x < 3, and Y= X°. The transformation is not monotonic on 
[-1, 3]; and, since f(x) is not an even function, we can’t employ the symmetry trick of the previous 
example. Possible values of Y are {y:0 < y < 9}. Considering firstO0 < y < 1, 
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Then, on the other subinterval, 1 < y < 9, 


Fy(y) = PUY Sy) = P(X? Sy) = P(-V9 SX V9) = P(-1SX< yi) 


Vy 
u+1 
a 3 du = (1+y+2,/y)/16 
-1 
Differentiating, we get 
1 
—__ 0<y<il 
_) 8w 
l<y<9 
l6y 
Figure 4.40 shows the pdfs of both X and Y. 
a b 
fe) fry) 
rY ry 
1/2 4 0.2 
3/8 4 0.15 
1/4 4 0.1 
1/8 0.05 4 
1 1 ex 0 r r —y 
-l 0 1 2 3 1 5 9 
Figure 4.40 Pdfs from Example 4.44: (a) pdf of X; (b) pdf of Y | 

Exercises: Section 4.7 (110-124) 

110. Relative to the winning time, the time X of the pdf f(x) = 2x, 0 < x < 1. Determine the 
another runner in a ten kilometer race has pdf of Y = 1/X, which is fuel efficiency in 
pdf f(x) = 23, x>1. The reciprocal gallons per mile. [Note: The distribution of 
Y = 1/X represents the ratio of the time for Y is a special case of the Pareto distribution 
the winner divided by the time of the other (see Exercise 10).] 


runner. Find the pdf of Y. Explain why 119 Tet X have the pdf fy(x) = 2/3, x > 1. Find 
Y also represents the speed of the other the pdf of ¥ = VX 


runner relative to the winner. 
; : : 113. Let X have an exponential distribution with 
111. Let X be the fuel efficiency in miles per 


— 1,-x/2 . 
gallon of an extremely inefficient vehicle (a mean 2, 80 f(x) = je", x > 0. Find the 
military tank, perhaps?), and suppose X has pdf of Y = VX. [Note: Suppose you choose 
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114. 


115. 


116. 


117. 


118. 


119. 


a point in two dimensions randomly, with 
the horizontal and vertical coordinates 
chosen independently from the standard 
normal distribution. Then X has the distri- 
bution of the squared distance from the 
origin and Y has the distribution of the 
distance from the origin. Y has a Rayleigh 
distribution (see Exercise 4).] 


If X is distributed as N(u, 0), find the pdf of 
Y=e*. Verify that the distribution of 
Y matches the lognormal pdf provided in 
Section 3.5. 


If the length of a side of a square X is 
random with the pdf fx(x) = x/8, 0 <x < 4, 
and Y is the area of the square, find the pdf 
of Y. 


Let X ~ Unif(0, 1). Determine the pdf of 
Y = —In(X). 

Let X ~ Unif(0, 1). Determine the pdf of 
Y = tan[m(X — .5)]. [Note: The random 
variable Y has the Cauchy distribution, 
named after the famous mathematician. ] 


If X ~ Unif[0, 1], find a linear transfor- 
mation Y = cX + d such that Y is uniformly 
distributed on [A, B], where A and B are any 
two numbers such that A < B. Is there any 
other solution? Explain. 


If X has the pdf fx(x) = x/8, 0 < x < 4, find 
a transformation Y = g(X) such that Y ~ 
Unif[0, 1]. [Hint: The target is to achieve 
fro) = 1 for 0 < y < 1. The Transfor- 
mation Theorem will allow you to find h(y), 
from which g(x) can be obtained. ] 


120. 


121. 


122. 


123. 


124. 
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a. If a measurement error X is uniformly 
distributed on [-1, 1], find the pdf of 
Y = |X|, which is the magnitude of the 
measurement error. 

b. If X ~ Unif[-1, 1], find the pdf of 
fas, 

c. If X ~ Unif[-1, 3], find the pdf of 
y=x". 

If a measurement error X is distributed as 

N(O, 1), find the pdf of |X|, which is the 

magnitude of the measurement error. 


AAnn is expected at 7:00 pm after an all-day 
drive. She may be as much as one hour early 
or as much as three hours late. Assuming that 
her arrival time X is uniformly distributed 
over that interval, find the pdf of |X — 7|, the 
absolute difference between her actual and 
predicted arrival times. 


A circular target has radius 1 foot. Assume 
that you hit the target (we shall ignore 
misses) and that the probability of hitting 
any region of the target is proportional to 
the region’s area. If you hit the target at a 
distance Y from the center, then let X = nY* 
be the corresponding area. Show that 


a. Xis uniformly distributed on [0, 1]. [Hint: 
Show that Fy(x) = P(X < x) = x/t.] 
b. Y has pdf fy) = 2v,0<y<l. 


In the previous exercise, suppose instead 
that Y is uniformly distributed on [0, 1]. 
Find the pdf of X = nY*. Geometrically 
speaking, why should X have a pdf that is 
unbounded near 0? 


4.8 Simulation of Continuous Random Variables 


In Sections 2.6 and 3.8, we discussed the need for simulation of random events and discrete random 
variables in situations where an “analytic” solution is very difficult or simply not possible. This 
section presents methods for simulating continuous random variables, including some of the built-in 
simulation tools of R. 


The Inverse CDF Method 
Section 3.8 introduced the inverse cdf method for simulating discrete random variables. The basic 
idea was this: generate a Unif[0, 1) random number and align it with the cdf of the random variable 
X we want to simulate. Then, determine which X value corresponds to that cdf value. We now extend 
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this methodology to the simulation of values from a continuous distribution; the heart of the algorithm 
relies on the following theorem, often called the probability integral transform. 


THEOREM Consider a continuous distribution with pdf f and cdf F. Let U ~ Unif[0, 1), 
and define a random variable X by 


RP) (4.13) 


Then the pdf of X is f. 


Before proving this theorem, let’s consider its practical usage: Suppose we want to simulate a 
continuous rv whose pdf is f(x), i-e., obtain successive values of X having pdf f(x). If we can determine 
the corresponding cdf F(x) and apply its inverse F~! to values wy, ..., Up, Obtained from a standard 
uniform distribution, then x; = F~'(u1),...,% = F~'(un) will be values from the desired distribu- 
tion f. A graphical description of the algorithm appears in Figure 4.41. 


> X 


Y 
F~ (uy) F (uy) 


Figure 4.41 The inverse cdf method, illustrated 


Proof Apply the Transformation Theorem (Section 4.7) with fy(u) = 1 for 0 < u<1, X= g(U) 
= F"'(U), and thus U = h(X) = g '(X) = F(X). The pdf of the transformed variable X is 


f(x) = fulh@)) - 1A’) =fulF@)) FO) = 1 FO) =f) 
In the last step, the absolute values may be removed because a pdf is always nonnegative. i 


The following box describes the implementation of the inverse cdf method justified by the pre- 
ceding theorem. 
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INVERSE CDF It is desired to simulate n values from a distribution pdf f(x). Let F(x) be the 
METHOD corresponding cdf. Repeat n times: 


1. Use a random number generator (RNG) to produce a value, u, from [0, 1). 
2. Assign x = F '(u). 


The resulting values x,, ..., x, form a simulation of a random variable with 
the original pdf, f(x). 


Example 4.45 Consider the electrical current distribution model of Example 4.42, where the pdf of 
X is given by f(x) = 1.25 — .25x for 2 < x < 4. Suppose a simulation of X is required as part of 
some larger system analysis. To implement the above method, the inverse of the cdf of X is required. 
First, compute the cdf: 


F(x) = P(X <x) = i FO) dy 
2 


x 


= f (1.25 — 25y)ay = -0.125: +1.254-2 2<x<4 
v) 


To find the probability integral transform (4.13), set u = F(x) and solve for x: 
u = F(x) = —0.125x° +1.25x -2 > x= F'(u) = 5-—V9—-8u 


The equation above can be solved using the quadratic formula; care must be taken to select the 
solution whose values lie in the interval [2, 4] (the other solution, x = 5+ /9 — 8u, does not have 
that feature). Beginning with the usual Unif[0, 1) RNG, the algorithm for simulating X is the 
following: given a value u from the RNG, assign x = 5 — /9 — 8u. Repeating this algorithm n times 
gives n simulated values of X. An R program that implements this algorithm appears in Figure 4.42; it 
returns a vector, x, containing n = 10,000 simulated values of the specified distribution. 


x <- NULL 

for (i in 1:10000) { 
u<-runif (1) 
x[i]<-5-sqrt (9-8*u) 


Figure 4.42 R simulation code for Example 3.42 


As discussed in Chapter 2, this program can be accelerated by “vectorizing” the operations rather 
than using a for loop. In fact, a single line of code can produce the desired result: 


x<-5-sqrt (9-8*runif (10000) ) 


The pdf of the rv X and a histogram of simulation results appear in Figure 4.43. 
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Figure 4.43 (a) Theoretical pdf and (b) R simulation results for Example 4.45 a 


Example 4.46 The lifetime of a certain type of drill bit has an exponential distribution with mean 
100 h. An analysis of a large manufacturing process that uses these drill bits requires the simulation 
of this lifetime distribution, which can be achieved through the inverse cdf method. From Section 4.4, 
the cdf of this exponential distribution is F(x) = 1 — eo", and so the inverse cdf is x = F (u) 7 
—100In(1 — u). Applying this function to Unif[0, 1) random numbers will generate the desired 
simulation. (Don’t let the negative sign at the front worry you: sinceO < u < 1,1 — u lies between 0 
and 1, and so its logarithm is negative and the resulting value of x is actually positive.) 

As a check, the code x=-100*1log(1-runif (10000) ) was submitted to R and the resulting 
sample mean and sd were obtained using mean (x) and sd (x). Exponentially distributed rvs have 
standard deviation equal to the mean, so the theoretical answers are uw = 100 and o = 100. The 
simulation yielded x = 99.3724 and s = 100.8908, both of which are reasonably close to 100 and 
validate the inverse cdf formula. 

In general, an exponential distribution with mean yw (equivalently, parameter 2 = 1/u) can be 
simulated using the transform x = —u1In(1 — uw). | 


The preceding two examples illustrated the inverse cdf method for fairly simple density functions: 
a linear polynomial and an exponential function. In practice, the algebraic complexity of f(x) can often 
be a barrier to implementing this simulation technique. After all, the algorithm requires that we can 
(1) obtain the cdf F(x) in closed form and (2) find the inverse function of F in closed form. Consider, 
for example, attempting to simulate values from the N(O, 1) distribution: its cdf is the function 


denoted ®(z) and given by the integral expression (1/V/2n) (oe e/2du. There is no closed-form 
expression for this integral, let alone a method to solve u = ®(z) for z and implement (4.13). (As a 
reminder, the lack of a closed-form expression for ®(z) is the reason that software or tables are always 
required for calculations involving normal probabilities.) Thankfully, most statistical software 
packages have built-in tools to simulate normally distributed variates (using a very clever algorithm 
called the Box-Muller method; see Section 5.6). We’ll discuss R’s built-in simulation tools at the end 
of this section. 

As the next example illustrates, even when F(x) can be determined in closed form we cannot 
necessarily implement the inverse cdf method, because F(x) cannot always be inverted. This difficulty 
surfaces in practice when attempting to simulate values from a gamma distribution. 
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Example 4.47 The measurement error X (in mV) of a particular voltmeter has the following dis- 
tribution: f(x) = (4 - x9 for-1 < x < 2 (and f(x) = 0 otherwise). To use the inverse cdf method 
to simulate X, begin by calculating its cdf: 


x 


Ay? —# 129411 
Fa) = f 7 = 27 


-1 


To implement step 2 of the inverse cdf method requires solving F(x) = u for x; since F(x) is a cubic 
polynomial, this is not a simple task. Advanced computer algebra systems can solve this equation, 
though the general solution is unwieldy (and such a solution doesn’t exist at all for Sth-degree and 
higher polynomials). Readers familiar with numerical analysis methods may recognize that, for any 
specified numerical value of u, a root-finding algorithm (such as Newton—Raphson) can be imple- 
mented to approximate the solution x. This latter method, however, is computationally intensive, 
especially if it’s desirable to generate 10,000 or more simulated values of x. Hi 


The preceding example suggests that in practice not every continuous distribution can be simulated 
via the inverse cdf method. When the inverse cdf method of simulation cannot be implemented, the 
accept—reject method provides an alternative. The downside of the accept—reject method is that only 
some of the random numbers generated by software will be used (“accepted”), while others will be 
“rejected.” As a result, one needs to create more—sometimes, many more—random variates than the 
desired number of simulated values. For information on the accept—reject method, consult the texts by 
Ross or Carlton and Devore listed in the bibliography. 


Built-in Simulation Packages for R 

As was true for the most common discrete distributions, many software packages have built-in tools 
for simulating values from the continuous models named in this chapter. Table 4.4 summarizes the 
relevant R functions for the uniform, normal, gamma, and exponential distributions; the variable 
n refers to the desired number of simulated values of the distribution. R includes similar commands 
for the Weibull, lognormal, and beta distributions. 


Table 4.4 Functions to simulate major continuous distributions in R 


Distribution R code 

Unif[A, B] runif (n,A,B) 
Nu, 6) rnorm(n,,¢) 
Gamma(a, f) rgamma (n, «, 1/f) 
Exponential(/) rexp (n, A) 


As was the case with the cdf commands discussed in Section 4.4, R parameterizes the gamma and 
exponential distributions using the “rate” parameter 2 = 1/8. In the gamma simulation command, this 
can be overridden by naming the final argument scale, as in rgamma(n,a,scale=f). The 
command rnorm(n) will generate standard normal variates (i.e., with w = 0 and o = 1). Similarly, 
R will generate standard uniform variates (A = 0 and B = 1), the basis for many of our simulation 
methods, with the command runif (n). 
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Precision of Simulation Results 
Section 3.8 discusses in detail the precision of estimates associated with simulating discrete random 
variables. The same results apply in the continuous case. In particular, the estimated standard error in 


using a sample proportion pf to estimate the true probability of an event is still ,/p(1 — p)/n, where 
n is the simulation size. Also, the estimated standard error in using a sample mean, x, to estimate the 
true expected value yu of a (continuous) rv X is s/./n, where s is the sample standard deviation of the 
simulated values of X. Refer back to Section 3.8 for more details. 


Exercises: Section 4.8 (125-130) 


125. 


126. 


The amount of time (hours) required to 
complete an unusually short statistics 
homework assignment is modeled by the 
pdf fix)=x/2 for 0O<x<2 (and=0 
otherwise). 


a. Obtain the cdf and then its inverse. 

b. Write a program to simulate 10,000 
values from this distribution. 

c. Compare the sample mean and standard 
deviation of your 10,000 simulated val- 
ues to the theoretical mean and sd of this 
distribution (which you can _ deter- 
mine by calculating the appropriate 
integrals). 


The Weibull distribution was introduced in 
Section 4.5. 


a. Find the inverse of the Weibull cdf. 

b. Write a program to simulate n values 
from a Weibull distribution. Your pro- 
gram should have three inputs: the 
desired number of simulated values 
n and the two parameters « and f. It 
should have a single output: an n x 1 
vector of simulated values. 

c. Use your program from part (b) to sim- 
ulate 10,000 values from a Weibull(4, 6) 
distribution and estimate the mean of 
this distribution. The correct value of the 
mean is 61 (5/4) = 5.438; how close is 
your sample mean? 


127. 


128. 


Consider the pdf for the rv X = magnitude 
(in newtons) of a dynamic load on a bridge, 
given in Example 4.7: 


1 
f(x) =5tq O0<x<2 


Write a program to simulate values from this 
distribution using the inverse cdf method. 


In distributed computing, any given task is 
split into smaller subtasks which are han- 
dled by separate processors (which are then 
re-combined by a multiplexer). Consider a 
distributed computing system with 4 pro- 
cessors, and suppose for one particular 
purpose that pdf of completion time for a 
particular subtask (microseconds) on any 
one of the processors is given by f(x) = 
20/(3x") for 4 < x < 10 and = 0 other- 
wise. That is, the subtask completion times 
X,, X2, X3, X4 of the four processors each 
have the specified pdf. 


a. Write a program to simulate the above 
pdf using the inverse cdf method. 

b. The overall time to complete any task is 
the largest of the four subtask comple- 
tion times: if we call this variable Y, then 
Y = max(X), Xz, X3, X4). (We assume 
that the multiplexing time is negligible.) 
Use your program in part (a) to simulate 
10,000 values of the rv Y. Create a 
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histogram of the simulated values of Y, 
and also use your simulation to estimate 
both E(Y) and ay. 

Consider the following pdf: 


f (x30, t) = “1 —x/1)*1 O<x<t 
where 0 > 0 and t > 0 are the parameters 
of the model. [This pdf is suggested for 
modeling waiting time in the article “A 
Model of Pedestrians’ Waiting Times for 
Street Crossings at Signalized Intersec- 
tions” (Trans. Res. 2013: 17—28).] 


a. Write a function to simulate values from 
this distribution, implementing — the 
inverse cdf method. Your function 
should have three inputs: the desired 
number of simulated values n and values 
for the two parameters for @ and tT. 

b. Use your function in part (a) to simulate 
10,000 values from this wait time dis- 
tribution with 0 = 4 and t = 80. Esti- 
mate E(X) under these parameter 
settings. How close is your estimate to 
the correct value of 16? 


Explain why the transformation x = —yIn(u) 
may be used to simulate values from an 
exponential distribution with mean w. (This 
expression is slightly simpler than the one 
established in this section.) 
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An insurance company issues a policy 
covering losses up to 5 (in thousands of 
dollars). The loss, X, follows a distribution 
with density function f(x) =3/x* for 
x > 1 and=0 otherwise. What is the 
expected value of the amount paid under 
the policy? 

A 12-in. bar clamped at both ends is sub- 
jected to an increasing amount of stress 
until it snaps. Let Y = the distance from the 
left end at which the break occurs. Suppose 
Y has pdf 
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fo)=4(1-4) oss 


Compute the following: 

a. The cdf of Y, and graph it. 

b. PY < 4),P(Y> 6),andP(4 < Y < 6). 

c. E(Y), E(Y’), and V(Y). 

d. The probability that the break point 
occurs more than 2 in. from the expec- 
ted break point. 

e. The expected length of the shorter seg- 
ment when the break occurs. 


Let X denote the time to failure (in years) of 
a hydraulic component. Suppose the pdf of 
X is f(x) = 32/(x + 4)° for x > 0. 

a. Verify that f(x) is a legitimate pdf. 

b. Determine the cdf. 

c. Use the result of part (b) to calculate the 
probability that time to failure is 
between 2 and 5 years. 

d. What is the expected time to failure? 

e. If the component has a salvage value 
equal to 100/(4 + x) when its time to 
failure is x, what is the expected salvage 
value? 


The completion time X for a task has cdf F 
(x) given by 


0 x<0 
x O0<x<1 
F(x) = : < 
Ve Vane se) eae! 
1 x>i 


a. Obtain the pdf f(x) and sketch its graph. 
b. Compute P55 < X < 2). 

c. Compute E(X). 

The breakdown voltage of a randomly 
chosen diode of a certain type is known to 


be normally distributed with mean value 
40 V and standard deviation 1.5 V. 


a. What is the probability that the voltage 
of a single diode is between 39 and 42? 
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b. What value is such that only 15% of all 
diodes have voltages exceeding that 
value? 

c. If four diodes are independently selec- 
ted, what is the probability that at least 
one has a voltage exceeding 42? 


The article “Computer Assisted Net Weight 
Control” (Qual. Prog. 1983: 22-25) sug- 
gests a normal distribution with mean 
137.2 oz and standard deviation 1.6 oz, for 
the actual contents of jars of a certain type. 
The stated contents was 135 oz. 


a. What is the probability that a single jar 
contains more than the stated contents? 

b. Among ten randomly selected jars, what 
is the probability that at least eight contain 
more than the stated contents? 

c. Assuming that the mean remains at 
137.2, to what value would the standard 
deviation have to be changed so that 
95% of all jars contain more than the 
stated contents? 


When circuit boards used in the manufac- 
ture of MP3 players are tested, the long-run 
percentage of defectives is 5%. Suppose 
that a batch of 250 boards has been 
received and that the condition of any 
particular board is independent of that of 
any other board. 


a. What is the approximate probability that 
at least 10% of the boards in the batch 
are defective? 

b. What is the approximate probability that 
there are exactly ten defectives in the 
batch? 


Let X be a nonnegative continuous random 
variable with pdf f(x), cdf F(x), and mean 
E(X). 


a. The definition of expected value is 
E(X) = Jy xf (x)dx. Replace the first 
x inside the integral with fj 1 dy to 
create a double integral expression for 
E(X). [The “order of integration” 
should be dy dx.] 
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b. Re-arrange the order of integration, 
keeping track of the revised limits of 
integration, to show that 


c. Evaluate the dx integral in (b) to show 
that E(X) = i [1 —F(y)]dy. (This 
provides an alternate derivation of the 
formula established in Exercise 38.) 

d. Use the result of (c) to verify that the 
expected value of an exponentially dis- 
tributed rv with parameter / is 1//. 


The reaction time (in seconds) to a stimulus 

is a continuous random variable with pdf 

fe =152 for 1S es 3. and=0 
otherwise. 

a. Obtain the cdf. 

b. Using the cdf, what is the probability 
that reaction time is at most 2.5 s? 
Between 1.5 and 2.5 s? 

c. Compute the expected reaction time. 

d. Compute the standard deviation of 
reaction time. 

e. If an individual takes more than 1.5 s to 
react, a light comes on and stays on 
either until one further second has 
elapsed or until the person reacts 
(whichever happens first). Determine 
the expected amount of time that the 
light remains lit. [Hint: Let h(X) = the 
time that the light is on as a function of 
reaction time X.] 


Let X denote the temperature at which a 

certain chemical reaction takes place. Sup- 

pose that X has pdf f(x) = (4—.x)/9 for 

-l1 < x < 2 and = 0 otherwise. 

a. Sketch the graph of f(x). 

b. Determine the cdf and sketch it. 

c. Is 0 the median temperature at which the 
reaction takes place? If not, is the median 
temperature smaller or larger than 0? 

d. Suppose this reaction is independently 
carried out once in each of ten different 
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laboratories and that the pdf of reaction 
time in each laboratory is as given. Let 
Y = the number among the ten labora- 
tories at which the temperature exceeds 
1. What kind of distribution does 
Y have? (Give the name and values of 
any parameters.) 


The article “Determination of the MTF of 
Positive Photoresists Using the Monte 
Carlo Method” (Photographic Sci. Engr. 
1983: 254-260) proposes the exponential 
distribution with parameter 2 = .93 as a 
model for the distribution of a photon’s free 
path length (um) under certain circum- 
stances. Suppose this is the correct model. 


a. What is the expected path length, and 
what is the standard deviation of path 
length? 

b. What is the probability that path length 
exceeds 3.0? What is the probability 
that path length is between 1.0 and 3.0? 

c. What value is exceeded by only 10% of 
all path lengths? 


The article “The Prediction of Corrosion by 
Statistical Analysis of Corrosion Profiles” 
(Corrosion Sci. 1985: 305-315) suggests 
the following cdf for the depth X of the 
deepest pit in an experiment involving the 
exposure of carbon manganese steel to 
acidified seawater. 


— e-(-0)/0 


F(x; 0,02) =e — 00 <x<00 


(This is called the Gumbel distribution.) 
The investigators proposed the values 
0, = 150 and 0, = 90. Assume this to be 
the correct model. 


a. What is the probability that the depth of 
the deepest pit is at most 150? At most 
300? Between 150 and 300? 

b. Below what value will the depth of the 
maximum pit be observed in 90% of all 
such experiments? 

c. What is the density function of X? 
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d. The density function can be shown to be 
unimodal (a single peak). Above what 
value on the measurement axis does this 
peak occur? (This value is the mode.) 

e. Itcanbe shown that E(X) © 577205 + 04. 
What is the mean for the given values of 0; 
and 03, and how does it compare to the 
median and mode? Sketch the graph of the 
density function. 


Let ¢ = the amount of sales tax a retailer 

owes the government for a certain period. 

The article “Statistical Sampling in Tax 

Audits” (Statistics Law 2008: 320-343) 

proposes modeling the uncertainty in ¢ by 

regarding it as a normally distributed ran- 
dom variable with mean value y and stan- 
dard deviation o (in the article, these two 
parameters are estimated from the results of 

a tax audit involving n sampled transac- 

tions). If a represents the amount the retailer 

is assessed, then an underassessment results 
if t > a and an overassessment if a > t. We 

can express this in terms of a Joss function, a 

function that shows zero loss if t = a but 

increases as the gap between ¢ and a in- 
creases. The proposed loss function is 

L(a,t)=t—a if t>a and=k(a — f) if 

t < a(k> 1 is suggested to incorporate the 

idea that overassessment is more serious 

than underassessment). 

a. Show that a* = p+o@'!(1/(k+1)) is 
the value of a that minimizes the 
expected loss, where ®~! is the inverse 
function of the standard normal cdf. 

b. If k=2 (suggested in the article), 
Lt = $100,000, and o = $10,000, what is 
the optimal value of a, and what is 
the resulting probability of overassess- 
ment? 


A mode of a continuous distribution is a 
value x* that maximizes f(x). 


a. What is the mode of a normal distribu- 
tion with parameters 4 and o? 
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b. Does the uniform distribution with 
parameters A and B have a single mode? 
Why or why not? 

c. What is the mode of an exponential 
distribution with parameter 2? (Draw a 
picture.) 

d. If X has a gamma distribution with 
parameters « and f, and « > 1, find the 
mode. [Hint: In[f(x)] will be maximized 
if and only if f(x) is, and it may be sim- 
pler to take the derivative of In[f(x)].] 

The article “Error Distribution in Naviga- 
tion” (J. Institut. Navigation 1971: 429- 
442) suggests that the frequency distribu- 
tion of positive errors (magnitudes of 
errors) is well approximated by an expo- 
nential distribution. Let X = the lateral 
position error (nautical miles), which can 
be either negative or positive. Suppose the 
pdf of X is 

age 


f(x) = — 00 <x<00 

a. Sketch a graph of f(x) and verify that it 
is a legitimate pdf (show that it inte- 
grates to 1). 

b. Obtain the cdf of X and sketch it. 

c. Compute P(X < 0), P(X < 2), 
P(-1 < X < 2), and the probability 
that an error of more than 2 miles is 
made. 


In some systems, a customer is allocated to 
one of two service facilities. If the service 
time for a customer served by facility i has 
an exponential distribution with parameter 
2, (i = 1, 2) and p is the proportion of all 
customers served by facility 1, then the pdf 
of X =the service time of a randomly 
selected customer is 


f (x51, 42,P) = 
paye** + (1 — p)Age”* x>0 


This is often called the hyperexponential 
or mixed exponential distribution. This 
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distribution is also proposed in the article 
“Statistical Behavior Modeling for Driver- 
Adaptive Precrash Systems” (IEEE Trans. 
Intelligent Transp. Syst. 2013: 1-9) as a 
model for modeling what the authors call 
“the criticality level of a situation.” 


a. Verify that fix; 21, 22, p) is indeed a pdf. 

b. If p=.5, 2, = 40, Az = 200 (A values 
suggested in the cited article), calculate 
P(X > 01). 

c. If X has f(x; 24, 22, p) as its pdf, what is 
E(x)? 

d. Using the fact that E(X’) = 2/2? when 
X has an exponential distribution with 
parameter 2, compute E(X’) when X has 
pdf fix; 41, 42, p). Then compute V(X). 

e. The coefficient of variation of a random 
variable (or distribution) is CV = o/w. 
What is the CV for an exponential rv? 
What can you say about the value of CV 


when X has a_ hyperexponential 
distribution? 
f. What is the CV for an Erlang distribu- 


tion with parameters 4 and n as defined 
in Exercise 78? [Note: In applied work, 
the sample CV is used to decide which 
of the three distributions might be 
appropriate. ] 

g. For the parameter values given in (b), 
calculate the probability that X is within 
one standard deviation of its mean 
value. Does this probability depend 
upon the values of the 2’s (it does not 
depend on 1 when X has an exponential 
distribution)? 

Suppose a state allows individuals filing tax 
returns to itemize deductions only if the 
total of all itemized deductions is at least 
$5000. Let X (in 1000’s of dollars) be the 
total of itemized deductions on a randomly 
chosen form. Assume that X has the pdf 


f(xy a) =k/x* x>5 
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a. Find the value of k. What restriction on 
a is necessary? 

b. What is the cdf of X? 

c. What is the expected total deduction on 
a randomly chosen form? What restric- 
tion on « is necessary for E(X) to be 
finite? 

d. Show that In(X/5) has an exponential 
distribution with parameter « — 1. 


Let J; be the input current to a transistor and 

I, be the output current. Then the current 

gain is proportional to In(/,/J;). Suppose the 

constant of proportionality is 1 (which 
amounts to choosing a particular unit of 
measurement), so that current gain = X = 

In(/,/7;). Assume X is normally distributed 

with wp = 1 and o = 05. 

a. What type of distribution does the ratio 
I,/I; have? 

b. What is the probability that the output 
current is more than twice the input 
current? 

c. What are the expected value and vari- 
ance of the ratio of output to input 
current? 


The article “Response of SiC/SizN4 Com- 

posites Under Static and Cyclic Loading— 

An Experimental and Statistical Analysis” 

(J. Engr. Mater. Tech. 1997: 186-193) 

suggests that tensile strength (MPa) of 

composites under specified conditions can 
be modeled by a Weibull distribution with 

a =9 and f = 180. 

a. Sketch a graph of the density function. 

b. What is the probability that the strength 
of a randomly selected specimen will 
exceed 175? Will be between 150 and 
175? 

c. If two randomly selected specimens are 
chosen and their strengths are indepen- 
dent of each other, what is the proba- 
bility that at least one has strength 
between 150 and 175? 

d. What strength value separates the 
weakest 10% of all specimens from the 
remaining 90%? 
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150. Suppose the lifetime X of a component, 
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when measured in hours, has a gamma 
distribution with parameters « and [. 
a. Let Y = lifetime measured in minutes. 
Derive the pdf of Y. 
b. What is the probability distribution of 
Y = cx? 
Based on data from a dart-throwing exper- 
iment, the article “Shooting Darts” 
(Chance, Summer 1997: 16-19) proposed 
that the horizontal and vertical errors from 
aiming at a point target should be inde- 
pendent of each other, each with a normal 
distribution having mean 0 and variance o”. 
It can then be shown that the pdf of the 
distance V from the target to the landing 
point is 


a. This pdf is a member of what family 
introduced in this chapter? 

b. If ¢ = 20 mm (close to the value sug- 
gested in the paper), what is the proba- 
bility that a dart will land within 25 mm 
(roughly | in.) of the target? 


The article “Three Sisters Give Birth on the 
Same Day” (Chance, Spring 2001: 23-25) 
used the fact that three Utah sisters had all 
given birth on March 11, 1998, as a basis 
for posing some interesting questions 
regarding birth coincidences. 


a. Disregarding leap year and assuming 
that the other 365 days are equally 
likely, what is the probability that three 
randomly selected births all occur on 
March 11? Be sure to indicate what, if 
any, extra assumptions you are making. 

b. With the assumptions used in part (a), 
what is the probability that three ran- 
domly selected births all occur on the 
same day? 

c. The author suggested that, based on 
extensive data, the length of gestation 
(time between conception and _ birth) 
could be modeled as having a normal 
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distribution with mean value 280 days 
and standard deviation 19.88 days. The 
due dates for the three Utah sisters were 
March 15, April 1, and April 4, respec- 
tively. Assuming that all three due dates 
are at the mean of the distribution, what 
is the probability that all births occurred 
on March 11? [Hint: The deviation of 
birth date from due date is normally 
distributed with mean 0.] 

d. Explain how you would use the infor- 
mation in part (c) to calculate the 
probability of a common birth date. 


153. Let X denote the lifetime of a component, 


with f(x) and F(x) the pdf and cdf of X. The 
probability that the component fails in the 
interval (x, x+ Ax) is approximately 
f(x) - Ax. The conditional probability that it 
fails in (x, x + Ax) given that it has lasted at 
least x is f(x) - Ax/[1 — F(x)]. Dividing this 
by Ax produces the failure rate function: 


1 — F(x) 
An increasing failure rate function indicates 
that older components are increasingly 
likely to wear out, whereas a decreasing 
failure rate is evidence of increasing relia- 
bility with age. In practice, a “bathtub- 
shaped” failure is often assumed. 


a. If X is exponentially distributed, what is 
r(x)? 

b. If X has a Weibull distribution with 
parameters « and f, what is r(x)? For 
what parameter values will r(x) be 
increasing? For what parameter values 
will r(x) decrease with x? 

c. Since r(x) = —(d/dx)In[1 — F(x)], 

In[{1 — F(x)] = fr(x)dx. Suppose 


r(x) =2(1-3) O<x<B 


so that if a component lasts f hours, it will 
last forever (while seemingly unreasonable, 
this model can be used to study just “initial 
wearout’”). What are the cdf and pdf of X? 
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Let X have a Weibull distribution with 
shape parameter « and scale parameter . 
Show that the the transformed variable 
Y = In(X) has an extreme value distribution 
as defined in Section 4.6, with 0, = In(f) 
and 05 = I/a. 


Let X have a Weibull distribution with 
parameters «=2 and f. Show _ that 
Y = 2X’/B* has a gamma distribution, and 
identify its parameters. 

Let X have the pdf f(x) = 1/[x1 + x) for 
—oo<x<oo (a central Cauchy distribu- 
tion), and show that Y = 1/X has the same 
distribution. [Hint: Consider P(|Y| < y), 
the cdf of |¥|, then obtain its pdf and show it 
is identical to the pdf of |X].] 


A store will order q gallons of a liquid 
product to meet demand during a particular 
time period. This product can be dispensed 
to customers in any amount desired, so 
demand during the period is a continuous 
random variable X with cdf F(x). There is a 
fixed cost co for ordering the product plus a 
cost of c,; per gallon purchased. The per 
gallon sale price of the product is d. Liquid 
left unsold at the end of the time period has 
a salvage value of e per gallon. Finally, if 
demand exceeds gq, there will be a shortage 
cost for loss of goodwill and future busi- 
ness; this cost is f per gallon of unfulfilled 
demand. Show that the value of gq that 
maximizes expected profit, denoted by q*, 
satisfies 


P(satisfying demand) = F(q*) 
d —Cc{ +f 
d—e+f 


Then determine the value of F(g*) if 
d = $35, co = $25, c; = $15, e = $5, and 
f = $25. [Hint: Let x denote a particular 
value of X. Develop an expression for profit 
when x < gq and another expression for 
profit when x > q. Now write an integral 
expression for expected profit (as a function 
of q) and differentiate. ] 
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158. A function g(x) is convex if the chord linear, the resulting inequality, usually 


connecting any two points on the function’s 
graph lies above the graph. When g(x) is 
differentiable, an equivalent condition is 
that for every x, the tangent line at x lies 
entirely on or below the graph. (See the 
accompanying figure.) How does g() = 
g[E(X)] compare to E[g(X)]? [Hint: The 
equation of the tangent line at x = is 
y = g(u) + g'(u) - & — pw). Use the condi- 
tion of convexity, substitute X for x, and 
take expected values.] Note: Unless g(x) is 


called Jensen’s inequality, is strict (< rather 
than <); itis valid for both continuous and 
discrete rvs. 
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Introduction 

In Chapters 3 and 4, we developed probability models for a single random variable. Many problems 
in probability and statistics lead to models involving several random variables simultaneously. In this 
chapter, we first discuss probability models for the joint behavior of several random variables, putting 
special emphasis on the case in which the variables are independent of each other. We then study 
expected values of functions of several random variables, including covariance and correlation as 
measures of the degree of association between two variables. 

Section 5.3 develops properties of linear combinations of random variables, with particular 
emphasis on the sum and the average. The next section considers conditional distributions, the 
distributions of random variables given the values of other random variables. In Section 5.5 we 
extend the normal distribution of Chapter 4 to two possibly dependent rvs. The next section is about 
transformations of two or more random variables, generalizing the results of Section 4.7. In the last 
section of this chapter we discuss the distribution of order statistics: the minimum, maximum, median, 
and other quantities that can be found by arranging the observations in order. 


5.1 Jointly Distributed Random Variables 


There are many experimental situations in which more than one random variable (rv) will be of 
interest to an investigator. For example X might be the number of books checked out from a public 
library on a particular day and Y the number of videos checked out on the same day. Or X and Y might 
be the height and weight, respectively, of a randomly selected adult. In general, the two rvs of interest 
could both be discrete, both be continuous, or one could be discrete and the other continuous. In 
practice, the two “pure” cases—both of the same type—predominate. We shall first consider joint 
probability distributions for two discrete rvs, then for two continuous variables, and finally for more 
than two variables. 


The Joint Probability Mass Function for Two Discrete Random Variables 

The probability mass function (pmf) of a single discrete rv X specifies how much probability mass is 
placed on each possible X value. The joint pmf of two discrete rvs X and Y describes how much 
probability mass is placed on each possible pair of values (x, y). 
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DEFINITION Let X and Y be two discrete rvs defined on the sample space # of an experiment. The 
joint probability mass function p(x, y) is defined for each pair of numbers (x, y) by 


p(x,y) = P(X =xand Y=y) 


A function p(x,y) can be used as a joint pmf provided that p(x,y)=O for all x and y and 
ex Dey P(x, y) = 1. Let A be any set consisting of pairs of (x, y) values, such as {(x, y): x + y < 10}. 
Then the probability that the random pair (X, Y) lies in A is obtained by summing the joint pmf over 
pairs in A: 


P((X,¥) €A) = 9 >> plx,y) 


(x,y)EA 


As in previous chapters, we will display a joint pmf for the values in its support—i.e., the set of all 
(x, y) values for which p(x, y) > O—with the understanding that p(x, y) = 0 otherwise. 


Example 5.1 A large insurance agency services a number of customers who have purchased both a 
homeowner’s policy and an automobile policy from the agency. For each type of policy, a deductible 
amount must be specified. For an automobile policy, the choices are $100 and $250, whereas for a 
homeowner’s policy, the choices are 0, $100, and $200. Suppose an individual with both types of 
policy is selected at random from the agency’s files. Let X = the deductible amount on the auto policy 
and Y = the deductible amount on the homeowner’s policy. Possible (X, Y) pairs are then (100, 0), 
(100, 100), (100, 200), (250, 0), (250, 100), and (250, 200); the joint pmf specifies the probability 
associated with each one of these pairs, with any other pair having probability zero. Suppose the joint 
pmf is given in the accompanying joint probability table: 


y 
pea 0 100 200 

100 20 10 20 

250 05 15 30 


Then p(100, 100) = P(X = 100 and Y = 100) = P($100 deductible on both policies) = .10. The 
probability P(Y > 100) is computed by summing probabilities of all (x, y) pairs for which y > 100: 


P(Y > 100) = p(100, 100) + p(250, 100) + p(100, 200) + p(250, 200) = .75 P| 


Looking at the joint probability table in Example 5.1, we see that P(X = 100), i.e., py(100), equals 
.20 + .10 + .20 = .50, and similarly p,(250) = .05 + .15 + .30 = .50 as well. That is, the pmf of X at 
a specified number is calculated by fixing an x value (say, 100 or 250) and summing across all 
possible y values; e.g., px(250) = p(250,0) + p(250,100) + p(250,200). The pmf of Y can be obtained 
by analogous summation, adding “down” the table instead of “across.” In fact, by adding across rows 
and down columns, we could imagine writing these probabilities in the margins of the joint proba- 
bility table; for this reason, py and py are called the marginal distributions of X and Y. 
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DEFINITION The marginal probability mass functions of X and of Y, denoted by py(x) 
and py(y), respectively, are given by 


px(x) =o pl(x,y) p(y) = dP) 


Thus to obtain the marginal pmf of X evaluated at, say, x = 100, the probabilities p(100, y) are added 
over all possible y values. Doing this for each possible X value gives the marginal pmf of X alone (i.e., 
without reference to Y). From the marginal pmfs, probabilities of events involving only X or only 
Y can be computed. 


Example 5.2 (Example 5.1 continued) The possible X values are x = 100 and x = 250, so com- 
puting row totals in the joint probability table yields 


px(100) = p(100, 0) + p(100, 100) + p(100, 200) = .50 
And 
Px(250) = p(250, 0) + p(250, 100) + p(250, 200) = .50 
The marginal pmf of X is then 
px(x) = .50 x= 100,250 


Similarly, the marginal pmf of Y is obtained from column totals as 


{25 y=0, 100 
pry) = { 50 y= 200 
so P(Y > 100) = py(100) + py(200) = .75 as before. a 


The Joint Probability Density Function for Two Continuous Random Variables 

The probability that the observed value of a continuous rv X lies in a one-dimensional set A (such as 
an interval) is obtained by integrating the pdf f(x) over the set A. Similarly, the probability that the pair 
(X, Y) of continuous rvs falls in a two-dimensional set A (such as a rectangle) is obtained by 
integrating a function called the joint density function. 


DEFINITION Let X and Y be continuous rvs. Then f(x, y) is the joint probability density function 
for X and Y if for any two-dimensional set A 


P((X,Y) € A) = J[ teaver 


In particular, if A is the two-dimensional rectangle {(x,y) :a<x<b,c<y<d}, 
then 
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b d 
PUX,¥) €A) = Plasx<bex¥ed)= ff fley)ayds 


For f(x, y) to be a joint pdf, it must satisfy fix, y) > 0 and f° [> f(x, y)dxdy = 1. We can think 
of f(x, y) as specifying a surface at height f(x, y) above the point (x, y) in a three-dimensional 
coordinate system. Then P((X, Y) € A) is the volume underneath this surface and above the region 
A, analogous to the area under a curve in the one-dimensional case. This is illustrated in Figure 5.1. 


SY) y 
A 


Surface f(x, y) 


A = Shaded 
rectangle 


x 


Figure 5.1 P((X, Y) € A) = volume under density surface above A 


Example 5.3 A bank operates both a drive-up facility and a walk-up window. On a randomly 
selected day, let X = the proportion of time that the drive-up facility is in use (at least one customer is 
being served or waiting to be served) and Y = the proportion of time that the walk-up window is in 
use. Then the set of possible values for (X, Y) is the rectangle D = {(x,y):0<x<1,0<y<l}. 
Suppose the joint pdf of (X, Y) is given by 


co co 1 1 

6 
| f seoracay= ff So-+y? paras 
—0o —0O 0 0 
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The probability that neither facility is busy more than one-quarter of the time is 


1/4 1/4 
p(o<x<to<y<+) = 2 + y*)dxd: 
Ste SS = tty )dedy 
0 O 
1/4 1/4 1/4 1/4 
=o ff saa +3f f 2 dxd 
=5 y+ y dxdy 
0 0 0 0 
7 6 x2 x=1/4 6 y? ‘ai 7 
20 2\,.9 20 3)» 640 
= .0109 a 


The marginal pmf of one discrete variable results from summing the joint pmf over all values of the 
other variable. Similarly, the marginal pdf of one continuous variable is obtained by integrating the 
joint pdf over all values of the other variable. 


DEFINITION The marginal probability density functions of X and Y, denoted by fy (x) and fy (y), 
respectively, are given by 


fx(x) = J fesnas for —co<x<oo 
fry) = f(x, y)dx for —oo<y<oo 


Example 5.4 (Example 5.3 continued) The marginal pdf of X, which gives the probability distri- 
bution of busy time for the drive-up facility without reference to the walk-up window, is 


1 
oy 6 2 
0 


for 0 < x < 1 and 0 otherwise. Similarly, the marginal pdf of Y is 


Then, for example, 
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In Examples 5.3-5.4, the region of positive joint density was a rectangle, which made computation of 
the marginal pdfs relatively easy. Consider now an example in which the region of positive density is 
a more complicated figure. 


Example 5.5 A nut company markets cans of deluxe mixed nuts containing almonds, cashews, and 
peanuts. Suppose the net weight of each can is exactly 1 lb, but the weight contribution of each type 
of nut is random. Because the three weights sum to 1, a joint probability model for any two gives all 
necessary information about the weight of the third type. Let X = the weight of almonds in a selected 
can and Y = the weight of cashews. Then the region of positive density is D = {(x,y):0<x<1, 
O0<y<1,x+y< 1}, the shaded region pictured in Figure 5.2. 


(0, 1) 


x (1, 0) x 


Figure 5.2 Region of positive density for Example 5.5 


Now let the joint pdf for (X, Y) be 
f(y) =24xy O<x<1, O<y<1, xt+y<1 


For any fixed x, f(x, y) increases with y; for fixed y, f(x, y) increases with x. This is appropriate because 
the word deluxe implies that most of the can should consist of almonds and cashews rather than 
peanuts, so that the density function should be large near the upper boundary and small near the 
origin. The surface determined by f(x, y) slopes upward from zero as (x, y) moves away from either 
axis. 

Clearly, f(x, y) > 0. To verify the second condition on a joint pdf, recall that a double integral is 
computed as an iterated integral by holding one variable fixed (such as x as in Figure 5.2), integrating 
over values of the other variable lying along the straight line passing through the value of the fixed 
variable, and finally integrating over all possible values of the fixed variable. Thus 


1 1—x 


i / f(x,y)dyde= i Flx,y)dydx = | DAxydy 
- [af 
| 


y=l—-x 1 
har / 12x(1 —x)*dx = 1 
=0 
0 
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To compute the probability that the two types of nuts together make up at most 50% of the can, let 
A= {(x,y):0<x<1,0<y<1l,andx+y<.5}, as shown in Figure 5.3. Then 


A = Shaded region 


0 x 5 1 


Figure 5.3 Computing P((X, Y) € A) for Example 5.5 
3 x 


PxY)€A)= ff flay) dray= ff 24xy dy dx = .0625 
A 0 0 


The marginal pdf for almonds is obtained by holding X fixed at x and integrating f(x, y) along the 
vertical line through x: 


oo 1l-x 
els) =f fny)ay= f rarydy= 12x12)? OS x1 
—oo 0 


By symmetry of f(x, y) and the region D, the marginal pdf of Y is obtained by replacing x and X in 
fax) by y and Y, respectively: fy(y) = 12y(1—y)? forO < y < 1. ) 


Independent Random Variables 

In many situations, information about the observed value of one of the two variables X and Y gives 
information about the value of the other variable. In Example 5.1, the marginal probability of X at 
x = 250 was .5, as was the probability that X = 100. If, however, we are told that the selected 
individual had Y = 0, then X = 100 is four times as likely as X = 250. Thus there is a dependence 
between the two variables. 


In Chapter 2 we pointed out that one way of defining independence of two events is to say that 
A and B are independent if P(AMB) = P(A) - P(B). Here is an analogous definition for the inde- 
pendence of two rvs. 
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DEFINITION Two random variables X and Y are said to be independent if for every pair of 
x and y values, 


p(x,y) = px(x) + py(y) when X and Y are discrete 
or (5.1) 
f(x,y) = fx(x) -fy(y) when X and Y are continuous 


If (5.1) is not satisfied for all (x, y), then X and Y are said to be dependent. 


The definition says that two variables are independent if their joint pmf or pdf is the product of the 
two marginal pmfs or pdfs. 


Example 5.6 In the insurance situation of Examples 5.1 and 5.2, 
p(100, 100) = .10 ¥ (.5)(.25) = px(100) - py (100) 


so X and Y are not independent. Independence of X and Y requires that every entry in the joint 
probability table be the product of the corresponding row and column marginal probabilities. a 


Example 5.7 (Example 5.5 continued) Because f(x, y) in the nut scenario has the form of a product, 
X and Y might appear to be independent. However, although fy (3) = fy (3) = a, f (3 ; 3) =04 7 . z 
so the variables are not in fact independent. To be independent, f(x, y) must have the form g(x) - h(y) and 
the region of positive density must be a rectangle whose sides are parallel to the coordinate axes. MH 


Independence of two random variables is most useful when the description of the experiment 
under study tells us that X and Y have no effect on each other. Then once the marginal pmfs or pdfs 
have been specified, the joint pmf or pdf is simply the product of the two marginal functions. It 
follows that 


P({a<X<b}N{c<Y¥<d}) = P(a<X <b): P(c<¥<d) 
Example 5.8 Suppose that the lifetimes of two components are independent of each other and that 


the first lifetime, X,, has an exponential distribution with parameter 1, whereas the second, X>, has an 
exponential distribution with parameter A. Then the joint pdf is 


Ff (2152) =o (01) fis (02) = Ane 7" pe? = Aydge xy > 0,0 > 0 


Let 2, = 1/1000 and A, = 1/1200, so that the expected lifetimes are 1000 h and 1200 h, respectively. 
The probability that both component lifetimes are at least 1500 h is 


P(1500 < X;, 1500 < X,) = P(1500< X;) - P(1500 < Xz) = e771(1500) . e~4a(1500) — (2931) (.2865) = .0639 


5.1 Jointly Distributed Random Variables 285 


The probability that the sum of their lifetimes, X, + Xp, is at most 3000 h requires a double integral of 


the joint pdf: 


3000 3000—x2 


P(X +X. < 3000) = P(X < 3000 — X2) = i) / f (%1, X2)dxdx2 


0 0 
3000 3000—x2 


= i; / Ay Age 2-2? dx dxo 


0 0 
3000 
¥ px, 1 3000-2; 
=| Ane “nl _e oe dx 
0 
3000 
_ / Aye 2? [! _ gO) diy 
0 
3000 
=F / lene i ae med = 7564 ] 
0 


More than Two Random Variables 
To model the joint behavior of more than two random variables, we extend the concept of a joint 
distribution of two variables. 


DEFINITION 


If X,, Xo, ..., X,, are all discrete random variables, the joint pmf of the variables 
is the function 


P(X1,%2)-+-,Xn) = P({X1 = x} N{X_ = x} ++ N{X, = xn }) 


If the variables are continuous, the joint pdf of X,, X2,...,X, is the function 
f(%1,%2,---;%n) Such that for any n intervals [a,,b,],...,[Gn, Dn], 
by bn 
P(ay <X, <d,...,an < Xn <n) = / fee [fn o43%q) OXp.« dX] 
a an 
and more generally, for any n-dimensional set A, P(X), ... , X,) € A) results 


from integrating f( ) over A. 


Example 5.9 A binomial experiment consists of n dichotomous (success—failure), homogenous 
(constant success probability) independent trials. Now consider a trinomial experiment in which each 
of the n trials can result in one of three possible outcomes. For example, each successive customer at 
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a store might pay with cash, a credit card, or a debit card. The trials are assumed independent. 
Let p, = P(trial results in a type 1 outcome), and define pz and p3 analogously for type 2 and type 3 
outcomes. The random variables of interest here are X; = the number of trials that result in a type 
i outcome for i = 1, 2, 3. 

In n = 10 trials, the probability that the first five are type | outcomes, the next three are type 2, and 
the last two are type 3—1.e., the probability of the experimental outcome 1111122233—is Pp . P3 . P. 
This is also the probability of the outcome 1122311123, and in fact the probability of any outcome 
that has exactly five 1’s, three 2’s, and two 3’s. Now to determine the probability P(X, = 5, X, = 3, 
and X3 = 2), we have to count the number of outcomes that have exactly five 1’s, three 2’s, and two 


3’s. First, there are eS ways to choose five of the trials to be the type 1 outcomes. Now from the 


remaining five trials, we choose three to be the type 2 outcomes, which can be done in 3) ways. 


This determines the remaining two trials which consist of type 3 outcomes. So the total number of 
ways of choosing five 1’s, three 2’s, and two 3’s is 


10\ /5\_ 10! 5! 10! 
( 5 ) & ~ 5151 31a! sigian 


Thus we see that P(X, = 5, X) = 3, X3 = 2) = 2520 p} - p3- p3 . Generalizing this to n trials gives 


n! 


P(%1, 2,43) = P(X) = x1, X2 = X2,X3 = %3) = Pa PS 
XX :X3: 
for x, = 0, 1, 2, ...; x. = 0, 1, 2, ...; x3 = 0, 1, 2, ... such that x, + x. + x3 =n. Notice that whereas 


there are three random variables here, the third variable X3 is actually redundant, because for example 
in the case n = 10, having X, = 5 and X, = 3 implies that X; = 2 (just as in a binomial experiment 
there are actually two rvs—the number of successes and number of failures—but the latter is 
redundant). 

As an example, the genetic allele of a pea section can be either AA, Aa, or aa. A simple genetic 
model specifies P(AA) = .25, P(Aa) = .50, and P(aa) = .25. If the alleles of ten independently 
obtained sections are determined, the probability that exactly five of these are Aa and two are AA is 


10! 
p(2,5,3) = aig (25) (-50)°(.25)"= 0769 a 


The trinomial scenario of Example 5.9 can be generalized by considering a multinomial experiment 
consisting of 7 independent and identical trials, in which each trial can result in any one of r possible 
outcomes. Let p; = P(outcome i on any particular trial), and define random variables by X; = the 
number of trials resulting in outcome i (i= 1, . . ., 7). The joint pmf of X,, . . ., X, is called the 
multinomial distribution. An argument analogous to what was done in Example 5.9 gives the joint 
pmf of X;,..., X,: 

n! 


TZ sees pp’ for x; =0,1,2,... withx;+---+x,=n 
el 


P(X, ee ae) = Pee fem ee | 
X1:XQ:° 
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The case r=2 reduces to the binomial distribution, with X,; = number of successes and 
X> =n — X, = number of failures. Both the multinomial and binomial distributions model discrete 
rvs (counts). Now, let’s consider some examples with more than two continuous random variables. 


Example 5.10 When a certain method is used to collect a fixed volume of rock samples in a region, 
there are four resulting rock types. Let X,, X2, and X3 denote the proportion by volume of rock types 
1, 2, and 3 in a randomly selected sample (the proportion of rock type 4 is 1 — X,; — X, — X3, soa 
variable X, would be redundant). If the joint pdf of X,, X2, X3 is 


f(%1,%2,%3) = keyx2(l— 343) O<4, <1,0<x <1,0<43< 1x t+m+23<1 


then k is determined by 


cOoO.—0UMOl 


l= / / f flrs) dx3dx pdx 


—oo —0CO —0O 


1 1-x, 1—x,—x 


= i) kxx2(1 — x3)dx3 |} dx2 pdx 
0 


0 0 


This iterated integral has value k/144, so k = 144. The probability that rocks of types 1 and 2 together 
account for at most 50% of the sample is 


P(X, +X. < 5) = I f (1, X2,%3)dx3dx2dx1 


0<4<1 fori=1,2,3 
xy +X. +9351, 4 +x. <.5 


re) 5x1 1—x;—x2 5 
= / iy i 144x)x2(1 — x3)dx3| dx. pdx, = 0 | 
0 0 0 


The notion of independence of more than two random variables is similar to the notion of inde- 
pendence of more than two events. 


DEFINITION The random variables X,, X2,...,X, are said to be independent if for every subset 
Xj,,Xi,,..-,X;, of the variables (each pair, each triple, and so on), the joint pmf or 
pdf of the subset is equal to the product of the marginal pmfs or pdfs. 


Thus if the variables are independent with n = 4, then the joint pmf or pdf of any two variables is the 
product of the two marginals, and similarly for any three variables and all four variables together. 
Most important, once we are told that n variables are independent, then the joint pmf or pdf is the 
product of the n marginals. 
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Example 5.11 If X), ..., X,, represent the lifetimes of m components, the components operate 
independently of each other, and each lifetime is exponentially distributed with parameter /, then 


f(%1,%2,-- Xn) = (de-™) . es) vee (de) = se x > 0, x2 > 0,.. 5% 20 


If these n components are connected in series, so that the system will fail as soon as a single 
component fails, then the probability that the system lasts past time t is 


PUR > 1% > ) = fof Alert) dr ry 


II 
aa 
—— 2 
NG 
* 
ms 
= 
——.2 
NM 
iN 
& 
> 
= 


Therefore, 
P(system lifetime <r) =1—e-" forr>0 


which shows that system lifetime has an exponential distribution with parameter n/; the expected 
value of system lifetime is 1/(n2). 

A variation on the foregoing scenario appeared in the article “A Method for Correlating Field Life 
Degradation with Reliability Prediction for Electronic Modules” (Quality and Reliability Engr. Intl. 
2005: 715-726). The investigators considered a circuit card with n soldered chip resistors. The failure 
time of a card is the minimum of the individual solder connection failure times (mileages here). It was 
assumed that the solder connection failure mileages were independent, that failure mileage would 
exceed ¢ if and only if the shear strength of a connection exceeded a threshold d, and that each shear 
strength was normally distributed with a mean value and standard deviation that depended on the 
value of mileage t: u(f) = a; — aot and o(t) = a3 + agt (a weld’s shear strength typically deteriorates 
and becomes more variable as mileage increases). Then the probability that the failure mileage of a 


card exceeds f is 
d —(a,—ant)\\" 
P(T >t)= (1 of (a1 = ’)) 
a3 + a4t 


The cited article suggested values for d and the a,’s based on data. In contrast to the exponential 
scenario, normality of individual lifetimes does not imply normality of system lifetime. a 


In many experimental situations to be considered in this book, independence is a reasonable 
assumption, so that specifying the joint distribution reduces to deciding on appropriate marginal 
distributions. 
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Exercises: Section 5.1 (1-22) b. Compute P(X < land Y < 1) by adding 


probabilities from the joint pmf, and verify 
1. A service station has both self-service and 


full-service islands. On each island, there is a 
single regular unleaded pump with two hoses. 
Let X denote the number of hoses being used 
on the self-service island at a particular time, 
and let Y denote the number of hoses on the 
full-service island in use at that time. The 
joint pmf of X and Y appears in the accom- 
panying tabulation. 


y 
PC Y) 0 1 2 

0 10 04 02 

x 1 08 20 06 
2 06 14 30 


. What is P(X = 1 and Y = 1)? 

b. Compute P(X < 1 and Y <1). 

c. Give a word description of the event {X # 
0 and Y # 0}, and compute the probability 
of this event. 

d. Compute the marginal pmf of X and of 
Y. Using py(x), what is P(X <1)? 

e. Are X and Y independent rvs? Explain. 


2 


. A large but sparsely populated county has 
two small hospitals, one at the south end of 
the county and the other at the north end. The 
south hospital’s emergency room has 4 beds, 
whereas the north hospital’s emergency room 
has only 3 beds. Let X denote the number of 
south beds occupied at a particular time on a 
given day, and let Y denote the number of 
north beds occupied at the same time on the 
same day. Suppose that these two rvs are 
independent, that the pmf of X puts proba- 
bility masses .1, .2, .3, .2, and .2 on the 
x values 0, 1, 2, 3, and 4, respectively, and 
that the pmf of Y distributes probabilities .1, 
3, .4, and .2 on the y values 0, 1, 2, and 3, 
respectively. 


a. Display the joint pmf of X and Y in a joint 
probability table. 


that this equals the product of P(X < 1) 
and P(Y < 1). 

c. Express the event that the total number of 
beds occupied at the two hospitals com- 
bined is at most 1 in terms of X and Y, and 
then calculate this probability. 

d. What is the probability that at least one of 
the two hospitals has no beds occupied? 


. A market has both an express checkout line 


and a superexpress checkout line. Let X, 
denote the number of customers in line at the 
express checkout at a particular time of day, 
and let X, denote the number of customers in 
line at the superexpress checkout at the same 
time. Suppose the joint pmf of X, and X> is as 
given in the accompanying table. 


x2 


0 1 2 3 
0 .08 07 .04 -00 
1 .06 AS 05 04 
xy 2 .05 .04 10 06 
3 .00 .03 04 07 
4 .00 01 05 06 


a. What is P(X; = 1, X2 = 1), that is, the 
probability that there is exactly one cus- 
tomer in each line? 

b. What is P(X, = X2), that is, the proba- 
bility that the numbers of customers in the 
two lines are identical? 

c. Let A denote the event that there are at 
least two more customers in one line than 
in the other line. Express A in terms of X, 
and X>, and calculate the probability of 
this event. 

d. What is the probability that the total 
number of customers in the two lines is 
exactly four? At least four? 

e. Determine the marginal pmf of X,, and 
then calculate the expected number of 
customers in line at the express checkout. 

f. Determine the marginal pmf of X>. 
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g. By inspection of P(X, = 4), P(X2 = 0), 
and P(X; = 4, X,=0), are X, and X, 
independent random variables? Explain 
your reasoning. 


4. Suppose 51% of the individuals in a certain 
population have brown eyes, 32% have blue 
eyes, and the remainder have green eyes. 
Consider a random sample of 10 people from 
this population. 


a. What is the probability that 5 of the 10 
people have brown eyes, 3 of 10 have blue 
eyes, and the other 2 have green eyes? 

b. What is the probability that exactly one 
person in the sample has blue eyes and 
exactly one has green eyes? 

c. What is the probability that at least 7 of the 
10 people have brown eyes? [Hint: Think 
of brown as a success and all other eye 
colors as failures. ] 


5. At a certain university, 20% of all students 
are freshmen, 18% are sophomores, 21% are 
juniors, and 41% are seniors. As part of a 
promotion, the university bookstore is run- 
ning a raffle for which all students are eligi- 
ble. Ten students will be randomly selected to 
receive prizes (in the form of textbooks for 
the term). 


a. What is the probability the winners consist 
of two freshmen, two sophomores, two 
juniors, and four seniors? 

b. What is the probability the winners are 
split equally among under-classmen 
(freshmen and sophomores) and upper- 
classmen (juniors and seniors)? 

c. The raffle resulted in no freshmen being 
selected. The freshman class president 
complained that something must be amiss 
for this to occur. Do you agree? Explain. 

6. According to the Mars Candy Company, the 
long-run percentages of various colors of 

M&M milk chocolate candies are as follows: 


Blue: Orange: Green: Yellow: Red: Brown: 
24% 20% 16% 14% 13% 13% 
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a. In arandom sample of 12 candies, what is 
the probability that there are exactly two of 
each color? 

b. In a random sample of 6 candies, what is 
the probability that at least one color is not 
included? 

c. In a random sample of 10 candies, what is 
the probability that there are exactly 3 blue 
candies and exactly 2 orange candies? 

d. In a random sample of 10 candies, what is 
the probability that there are at most 3 orange 
candies? [Hint: Think of an orange candy as 
a success and any other color as a failure. ] 

e. In a random sample of 10 candies, what is 
the probability that at least 7 are either 
blue, orange, or green? 


. The number of customers waiting for gift-wrap 


service at a department store is an rv X with 
possible values 0, 1, 2, 3, 4 and corresponding 
probabilities .1, .2, .3, .25, .15. A randomly 
selected customer will have 1, 2, or 3 packages 
for wrapping with probabilities .6, .3, and .1, 
respectively. Let Y= the total number of 
packages to be wrapped for the customers 
waiting in line (assume that the number of 
packages submitted by one customer is inde- 
pendent of the number submitted by any other 
customer). 


a. Determine P(X =3, Y=3), that is, 


p(3, 3). 
b. Determine p(4, 11). 


. Let X denote the number of Sony 65” 4 K 


Ultra HD televisions sold during a particular 
week by a certain store. The pmf of X is 


x 0 1 2 3 4 
pix) LB SAS 


Sixty percent of all customers who purchase 
these TVs also buy an extended warranty. Let 
Y denote the number of purchasers during this 
week who buy an extended warranty. 


a. What is P(X = 4, Y = 2)? [Hint: This 
probability is P(Y = 2|X = 4)- P(X = 4); 
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now think of the four purchases as four 
trials of a binomial experiment, with 
success on a trial corresponding to 
buying an extended warranty. ] 

b. Calculate P(X = Y). 

c. Determine the joint pmf of X and Y and 
then the marginal pmf of Y. 


9. The joint probability distribution of the 


10. 


number X of cars and the number Y of 
buses per signal cycle at a proposed left- 
turn lane is displayed in the accompanying 
joint probability table. 


y 


p(x, y) 0 1 2 
0 025 015 010 
1 .050 .030 .020 
2 125 075 .050 
3 .150 .090 .060 
4 .100 .060 .040 
5 .050 .030 .020 


a. What is the probability that there is 
exactly one car and exactly one bus 
during a cycle? 

b. What is the probability that there is at 
most one car and at most one bus during 
a cycle? 

c. What is the probability that there is 
exactly one car during a cycle? Exactly 
one bus? 

d. Suppose the left-turn lane is to have a 
capacity of five cars, and one bus is 
equivalent to three cars. What is the 
probability of an overflow during a cycle? 

e. Are X and Y independent rvs? Explain. 


A stockroom currently has 30 components 
of a certain type, of which 8 were provided 
by supplier 1, 10 by supplier 2, and 12 by 
supplier 3. Six of these are to be randomly 
selected for a particular assembly. Let 
X = the number of supplier 1’s components 
selected, Y = the number of supplier 2’s 
components selected, and p(x, y) denote the 
joint pmf of X and Y. 


a. What is p(3, 2)? [Hint: Each sample of 
size 6 is equally likely to be selected. 
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Therefore, p(3, 2) = (number of outcomes 
with X = 3 and Y = 2)/(total number of 
outcomes). Now use the product rule for 
counting to obtain the numerator and 
denominator. | 

b. Using the logic of part (a), obtain 
p(, y). (This can be thought of as a mul- 
tivariate hypergeometric distribution— 
sampling without replacement from a 
finite population consisting of more than 
two categories.) 


11. Each front tire of a vehicle is supposed to 


be filled to a pressure of 26 psi. Suppose the 
actual air pressure in each tire is a random 
variable—X for the right tire and Y for the 
left tire, with joint pdf 


f(xy) =k? +y*) 20<x<30, 20<y<30 


a. What is the value of k? 

b. What is the probability that both tires 
are under-filled? 

c. What is the probability that the differ- 
ence in air pressure between the two 
tires is at most 2 psi? 

d. Determine the (marginal) distribution of 
air pressure in the right tire alone. 

e. Are X and Y independent rvs? 


12. Annie and Alvie have agreed to meet 


between 5:00 p.m. and 6:00 p.m. for dinner 
at a local health-food restaurant. Let X = 
Annie’s arrival time and Y = Alvie’s arrival 
time. Suppose X and Y are independent 
with each uniformly distributed on the 
interval [5, 6]. 


a. What is the joint pdf of X and Y? 

b. What is the probability that they both 
arrive between 5:15 and 5:45? 

c. If the first one to arrive will wait only 
10 min before leaving to eat elsewhere, 
what is the probability that they have 
dinner at the health-food restaurant? 
[Hint: The event of interest is 


A= {(x,y) :|e—yl Sof] 
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13. 


14. 


15. 


Two different professors have just submit- 
ted final exams for duplication. Let X de- 
note the number of typographical errors on 
the first professor’s exam and Y denote the 
number of such errors on the second exam. 
Suppose X has a Poisson distribution with 
parameter ,, Y has a Poisson distribution 
with parameter w2, and X and Y are 
independent. 


a. What is the joint pmf of X and Y? 

b. What is the probability that at most one 
error is made on both exams combined? 

c. Obtain a general expression for the 
probability that the total number of 
errors in the two exams is m (where m is 
a nonnegative integer). [Hint: A = {(x, y) : 
x+y =m} = {(m, 0), (m—1,1),..., 
(1,m— 1), (0,m)}. Now sum the joint 
pmf over (x, y) € A and use the binomial 
theorem, which says that 


a (%) a‘b"-* — (a+b)” for any 
a, b.] 


Two components of a computer have the 
following joint pdf for their useful lifetimes 
X and Y: 


f(x,y) =xe™*) =x>0, y>0 


a. What is the probability that the lifetime 
X of the first component exceeds 3? 

b. What are the marginal pdfs of X and Y? 
Are the two lifetimes independent? 
Explain. 

c. What is the probability that the lifetime 
of at least one component exceeds 3? 


You have two lightbulbs for a particular 
lamp. Let X = the lifetime of the first bulb 
and Y= the lifetime of the second bulb 
(both in thousands of hours). Suppose that 
X and Y are independent and that each has 
an exponential distribution with parameter 
A=1. 
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16. 


17. 


a. What is the joint pdf of X and Y? 

b. What is the probability that each bulb 
lasts at most 1000 h (ie., X < 1 and 
Y < 1)? 

c. What is the probability that the total 
lifetime of the two bulbs is at most 2? 
[Hint: Draw a picture of the region A = 
{(x,y):x>0,y>0,x+y<2} before 
integrating. ] 

d. What is the probability that the total 
lifetime is between | and 2? 


Suppose that you have ten lightbulbs, that 
the lifetime of each is independent of all the 
other lifetimes, and that each lifetime has an 
exponential distribution with parameter /. 


a. What is the probability that all ten bulbs 
fail before time f? 

b. What is the probability that exactly k of 
the ten bulbs fail before time t? 

c. Suppose that nine of the bulbs have 
lifetimes that are exponentially dis- 
tributed with parameter 2 and that the 
remaining bulb has a lifetime that is 
exponentially distributed with parameter 
@ (it is made by another manufacturer). 
What is the probability that exactly five 
of the ten bulbs fail before time rt? 


Consider a system consisting of three 
components as pictured. The system will 
continue to function as long as the first 
component functions and either component 
2 or component 3 functions. Let X,, X2, and 
X3 denote the lifetimes of components 1, 2, 
and 3, respectively. Suppose the X;’s are 
independent of each other and each X; 
has an exponential distribution with 
parameter /. 
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18. 


19. 


20. 


Jointly Distributed Random Variables 


a. Let Y denote the system lifetime. Obtain 
the cumulative distribution function of 
Y and differentiate to obtain the pdf. 
[Hint. F(y) =P(Y<y); express the 
event {Y < y} in terms of unions 
and/or intersections of the three events 
{Xi <y}, {X2 <y}, and {X3 <y}.] 

b. Compute the expected system lifetime. 


a. For f(x1,22,x3) as given in Example 
5.10, compute the joint marginal den- 
sity function of X, and X3 alone (by 
integrating over x2). 

b. Whatis the probability that rocks of types 1 
and 3 together make up at most 50% of the 
sample? [Hint: Use the result of part (a).] 

c. Compute the marginal pdf of X, alone. 
[Hint: Use the result of part (a).] 


An ecologist selects a point inside a circular 
sampling region according to a uniform 
distribution. Let X = the x coordinate of the 
point selected and Y = the y coordinate of 
the point selected. If the circle is centered at 
(0, 0) and has radius r, then the joint pdf of 
X and Y is 


a. What is the probability that the selected 
point is within 7/2 of the center of the 
circular region? [Hint: Draw a picture of 
the region of positive density 
D. Because f(x,y) is constant on D, 
computing a probability reduces to 
computing an area.] 

b. What is the probability that both X and 
Y differ from 0 by at most 7/2? 

c. Answer part (b) for r/ V2 replacing 1/2. 

d. What is the marginal pdf of X? Of Y? 
Are X and Y independent? 


Each customer making a particular Internet 
purchase must pay with one of three types 
of credit cards (think Visa, MasterCard, 
Amex). Let A; (i = 1, 2, 3) be the event that 
a type i credit card is used, with P(A) = .5, 
P(A2) = .3, P(A3) = .2. Suppose that the 


21. 


22. 
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number of customers who make a purchase 
on a given day, N, is a Poisson rv with 
parameter uw. Define rvs Xj, X2, X3 by 
X; = the number among the N customers 
who use a type i card (i = 1, 2, 3). Show 
that these three rvs are independent with 
Poisson distributions having parameters 
Su, .3u, and .2u, respectively. [Hint: For 
nonnegative integers x1, X2, x3, letn = x, + 
X2 + X3, SO PX, =X, Xo = Xo, X3 = X3) = 
PX, =X, X> = X2, X3 = X3, N= n). Now 
condition on N = n, in which case the three 
X;s have a trinomial distribution (multi- 
nomial with 3 categories) with category 
probabilities .5, .3, and .2.] 


Consider randomly selecting two points A 
and B on the circumference of a circle by 
selecting their angles of rotation, in 
degrees, independently from a uniform 
distribution on the interval [0, 360]. Con- 
nect points A and B with a straight line 
segment. What is the probability that this 
random chord is longer than the side of an 
equilateral triangle inscribed inside the cir- 
cle? [Hint: Place one of the vertices of the 
inscribed triangle at A. You should then be 
able to intuit the answer visually without 
having to do any integration.] 

(This is called Bertrand’s Chord Problem 
in the probability literature. There are other 
ways of randomly selecting a chord that 
give different answers from the one appro- 
priate here.) 


Consider the following technique, called 
the accept—reject method, for simulating 
values from a continuous distribution 
f. Identify a distribution g from which val- 
ues can already be simulated and a constant 
c > 1 such that f(x) < cg) for all 
x. Proceed as follows: (1) Generate 
Y ~ g and, independently, U ~ Unif[0, 1). 
(2) If u < fOy/cg(y), then let x=y 
(i.e., “accept” the y value); otherwise, dis- 
card (“reject”) y. (3) Repeat steps (1)-(2) 
until the desired number of x values is 
obtained. 
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a. Show that the probability a y value is c. Show that the accept—-reject method 
“accepted” equals I/c. [Hint: According does result in observations from f by 
to the algorithm, this occurs iff U < showing that P(accepted value < x) = 
S(X%)/cg(Y). Compute the relevant double F(x), where F is the cdf corresponding 
integral. ] to f. [Hint: Let X denote the accepted 

b. Argue that the average number of value. Then P(X < x)=P(Y < x| Y 
y values required to generate a single is accepted)=P(Y <x MM Y is 
accepted x value is c. acc.)/P(Y is acc.).] 


5.2 Expected Values, Covariance, and Correlation 


We previously saw that any function h(X) of a single rv X is itself a random variable. However, to 
compute E[h(X)], it was not necessary to obtain the probability distribution of h(X); instead, E[h(X)] 
was computed as a weighted average of h(X) values, where the weight function was the pmf p(x) or 
pdf f(x) of X. A similar result holds for a function h(X, Y) of two jointly distributed random variables. 


LAW OF THE Let X and Y be jointly distributed rvs with pmf p(x, y) or pdf f(x, y) according to 
UNCONSCIOUS _ whether the variables are discrete or continuous. Then the expected value of a 
STATISTICIAN function h(X, Y), denoted by E[h(X, Y)] or fy,x,y), is given by 


2 h(x, y) + p(x, y) if X and Y are discrete 
Elh(X,¥)] =< % "e (5.2) 
J J hwy) -f(x,y)dxdy if X and Y are continuous 


The Law of the Unconscious Statistician generalizes to computing the expected value of a function 
h(X,, ..., X,) of n random variables. If the X;’s are discrete, E[h(X,, ..., X,,)] is an n-dimensional sum; 
if the X;’s are continuous, it is an n-dimensional integral. 


Example 5.12 Five friends have purchased tickets to a concert. If the tickets are for seats 1-5 in a 
particular row and the tickets are randomly distributed among the five, what is the expected number 
of seats separating any particular two of the five? Let X and Y denote the seat numbers of the 
first and second individuals, respectively. Possible (X, Y) pairs are {(1,2), (1,3),...,(5,4)}, from 
which 


p(x,y)=.05 x=1,...,5; y=l,...,5; xy 


The number of seats separating the two individuals is h(X,Y) = |X — Y| — 1. The accompanying 
table gives h(x, y) for each possible (x, y) pair. 


x 
1 - 0 1 2 3 
2 0 - 0 1 2 
y 3 1 0 - 0 1 
4 2 i! 0 - 0 
5 3 2 1 0 - 
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Thus 


Example 5.13 In Example 5.5, the joint pdf of the amount X of almonds and amount Y of cashews in 
a 1-lb can of nuts was 


fixy)=Axy =< x<l, O<y<1, x4+y<1 


If 1 lb of almonds costs the company $6.00, 1 Ib of cashews costs $10.00, and 1 Ib of peanuts costs 
$3.50, then the total cost of the contents of a can is 


h(X, Y) = 6X + 10Y+3.5(1-X — Y) =3.54+2.5X+6.5Y 
(since 1 — X — Y of the weight consists of peanuts). The expected total cost is 


i= f [ wes h(x, y) «f(x, y)dx dy 


—oo = —0O 


xX 


1 
=f fi (3.5+2.5x+6.5y) - 24xydy dx = $7.10 | 
0 


ane 


Properties of Expected Value 

In Chapters 3 and 4, we saw that expected values can be distributed across addition, subtraction, and 
multiplication by constants. In the language of mathematics, expected value is a linear operator. This 
was a simple consequence of expectation being a sum or an integral, both of which are linear. This 
obvious but important property, linearity of expectation, extends to more than one variable. 


LINEARITY OF Let X and Y be random variables. Then, for any functions hy, h2 
EXPECTATION and any constants a, do, b, 


Elayhy(X, Y) + agho(X, Y) +b] = ayE[hy(X, Y)] + anE[ho(X, Y)| +b 


In the previous example, E(3.5 + 2.5X + 6.5Y) can be rewritten as 3.5 + 2.5E(X) + 6.5E(Y); the 
means of X and Y can be computed either by using (5.2) or by first finding the marginal pdfs of X and 
Y and then performing the appropriate single integrals. 

As another illustration, linearity of expectation tells us that for any two rvs X and Y, 


E(SXY* — 4xY + e* + 12) = 5E(XY*) — 4E(XY) + E(e*) +12 (5.3) 
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In general, we cannot distribute the expected value operation any further. But when A(X, Y) is a 
product of a function of X and a function of Y, the expected value simplifies in the case of inde- 
pendence. 


PROPOSITION Let X and Y be independent random variables. If h(X, Y) = g1(X) - g2(Y), 
then 


E|h(X, Y)| = Elgi(X) - ga(¥)] = Elgi(X)] - Elg2(¥)], 


assuming E[g,(X)] and E[g2(Y)] exist. 


Proof Consider two continuous rvs; the discrete case is similar. Apply (5.2): 


BWX Y= Ele(X) w= ff eile)-eb)-Flsy)drdy by 6.2 
a / / gi(x) - g2(y) -fx(x) -fr(y)dx dy because X and Y are independent 
=( f etter] (f 220)-fr0)ay) = Ele @Ele2(0)) a 


So, if X and Y are independent, Expression (5.3) simplifies further, to 5E(X)E(Y?) — 4E(X)E(Y) + 
E(e*) + 12. Not surprisingly, both linearity of expectation and the foregoing proposition can be 
extended to more than two random variables. 


Covariance 


When two random variables X and Y are not independent, it is frequently of interest to assess how 
strongly they are related to each other. 


DEFINITION The covariance between two rvs X and Y is 


Cov(X, Y) = E[(X — ux)(¥ — by)| 


a a (x — Hy) (y — Hy) p(x, y) if X and Y are discrete 
x y 
co ioe) 
/ / (x — Ly)(y — My) f(x, y)dxdy if X and Y are continuous 
—0o —0o 


The rationale for the definition is as follows. Suppose X and Y have a strong positive relationship to 
each other, by which we mean that large values of X tend to occur with large values of Y and small 
values of X with small values of Y (e.g., X = height and Y = weight). Then most of the probability 
mass or density will be associated with (x — Lx) and (y — py) either both positive (both X and Y above 
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their respective means) or both negative, so the product (x — Lx)(y — Ly) will tend to be positive. 
Thus for a strong positive relationship, Cov(X, Y) should be quite positive. For a strong negative 
relationship, the signs of (x — fx) and (y — py) will tend to be opposite, yielding a negative product. 
Thus for a strong negative relationship, Cov(X, Y) should be quite negative. If X and Y are not 
strongly related, positive and negative products will tend to cancel each other, yielding a covariance 
near 0. Figure 5.4 illustrates the different possibilities. The covariance depends on both the set of 
possible pairs and the probabilities. In Figure 5.4, the probabilities could be changed without altering 
the set of possible pairs, and this could drastically change the value of Cov(X, Y). 


b c 
Va a | ya 
-l+ —|+ 
ee ee e of e 
es io 
; 7 e , e 
My Pa Seen Hy x My — ‘ 
oe ee ee e 
ee | 
+} - +} - ‘ 
SS x. > xX > X 
Hy Hy Mx 
Figure 5.4 p(x,y) = 75 for each of ten pairs corresponding to indicated points; (a) positive covariance; (b) negative 


covariance; (c) covariance near zero 


Example 5.14 The joint and marginal pmfs for X = automobile policy deductible amount and 
Y = homeowner policy deductible amount in Example 5.1 were 


y 
P%y) 0 100 200 x 100 250 = y 0 100 200 


100.20 10 = .20 px(x) 5S py(y) 25.25.50 
250 05.15.30 


x 


from which py = >> x- px(x) = 175 and py = 125. Therefore, 
Cov(X,Y) = S 5° (x 175)(y — 125)p(x,y) 
(x,y) 


= (100 — 175)(0 — 125)(.20) + --- + (250 — 175)(200 — 125)(.30) 
= 1875 | 


The following proposition summarizes some important properties of covariance. 


PROPOSITION For any two random variables X and Y, 


. Cov(X, Y) = Cov(Y, X) 

. Cov(X, X) = V(X) 

. (Covariance shortcut formula) Cov(X, Y) = E(XY) — uy - Uy 

. (Distributive property of covariance) For any rv Z and any constants, a, D, c, 
Cov(aX + bY + c, Z) = aCov(X, Z) + bCov(Y, Z) 


BRwWN ke 
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Proof Property 1 is obvious from the definition of covariance. To establish Property 2, replace Y with 
X in the definition: 


Cov(X,X) = E[(X — uy)(X — wx)] = El(X - ux)"] = V(X) 
To prove Property 3, apply linearity of expectation: 
Cov(X, ¥) = El[(X — ny) (¥ — ny] 
XY — bx ¥ — pbyX + bby) 


[ 
( 
(XY) — byE(Y) — MyE(X) + byby 
( 


E 
=E 
E(XY) — bxby — Mxby + Mxby = E(XY) — pxby 


Property 4 also follows from linearity of expectation (Exercise 39). H 


According to Property 3 (the covariance shortcut), no intermediate subtractions are necessary to 
calculate covariance; only at the end of the computation is Wy - ly subtracted from E(XY). 


Example 5.15 (Example 5.5 continued) The joint and marginal pdfs of X = amount of almonds and 
Y = amount of cashews were 


fey) =Axy =< x<l, O<y<1, x+y<1 


f(x) =12x(1—x)? OS x<1 fry) =12y(1-y)? OS y<1 


2 
It is easily verified that uy = Uy = =, and 


5 
ie) ee) 1 1l-x 
E(XY) = / i xyf (x, y)dx dy =| / xy + 24xy dy dx 
—co —0O0 0 0 
1 
=8 | P(L—x)Pde = 
/ x (1 — x)°dx is 
0 
Thus Cov(X, Y) = = (2) (2) =2 = = Z. A negative covariance is reasonable here 
because more almonds in the can imply fewer cashews. a 


Correlation 

It would appear that the relationship in the insurance example is quite strong since Cov(X, Y) = 1875, 
whereas in the nut example Cov(X, Y) = —2/75 would seem to imply quite a weak relation- 
ship. Unfortunately, the covariance has a serious defect that makes it impossible to interpret a 
computed value of the covariance. In the insurance example, suppose we had expressed the deduc- 
tible amount in cents rather than in dollars. Then 100X would replace X, 100Y would replace Y, and 
the resulting covariance would be Cov(100X, 1O0Y) = (100) (100) Cov(X, Y) = 18,750,000. [To see 
why, apply properties 1 and 4 of the previous proposition.] If, on the other hand, the deductible 
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amount had been expressed in hundreds of dollars, the computed covariance would have changed to 
(.01) (.01) (1875) = .1875. The defect of covariance is that its computed value depends critically on 
the units of measurement. Ideally, the choice of units should have no effect on a measure of strength 
of relationship. This is achieved by scaling the covariance. 


DEFINITION The correlation coefficient of X and Y, denoted by Corr(X, Y), or py.y, or just p, is 
defined by 


Example 5.16 It is easily verified that in the insurance scenario of Example 5.14, E(X*) = 
36,250, o% = 36,250 — (175)°= 5625, ox = 75, E(Y”) = 22,500, o} = 6875, and ay = 82.92. This 
gives 
1875 
= aor = 301 
P ™ (75)(82.92) a 
The following proposition shows that p remedies the defect of Cov(X, Y) and also suggests how to 
recognize the existence of a strong (linear) relationship. 


PROPOSITION For any two rvs X and Y, 


1. Corr(X, Y) = Corr(Y, X) 

2. Corr(X, X) = 1 

3. (Scale invariance property) If a, b, c, d are constants and ac > 0, 
Corr(aX + b, cY + d) = Corr(X, Y) 

4. -1 < Corr(xX,Y) < 1. 


Proof Property 1 is clear from the definition of correlation and the corresponding property of covari- 
ance. To see why Property 2 is true, write Corr(X, X) = Cov(X, X)/|ox - ox] = V(X)/o% = 1. The 
second-to-last step uses Property 2 of covariance. The proofs of Properties 3 and 4 appear as exercises. Hi 


Property 3 (scale invariance) says precisely that the correlation coefficient is not affected by a 
linear change in the units of measurement. If, say, Y = completion time for a chemical reaction in 
seconds and X = temperature in °C, then Y/60 = time in minutes and 1.8X + 32 = temperature in °F, 
but Corr(X, Y) will be exactly the same as Corr(1.8X + 32, Y/60). 

According to Properties 2 and 4, the strongest possible positive relationship is evidenced by 
p = +1 whereas the strongest possible negative relationship corresponds to p = —1. Therefore, the 
correlation coefficient provides information about both the nature and strength of the relationship 
between X and Y: The sign of p indicates whether X and Y are positively or negatively related, and the 
magnitude of p describes the strength of that relationship on an absolute 0 to 1 scale. 

If we think of p(x, y) or f(, y) as prescribing a mathematical model for how the two numerical 
variables X and Y are distributed in some population (height and weight, verbal SAT score and 
quantitative SAT score, etc.), then p is a population characteristic or parameter that measures how 


300 5 Joint Probability Distributions and Their Applications 


strongly X and Y are related in the population. In Chapter 12, we will consider taking a sample of 
pairs (x1, y1),---,;(%n, Yn) from the population. The sample correlation coefficient r will then be 
defined and used to make inferences about p. 

While superior to covariance, the correlation coefficient p is actually not a completely general 
measure of the strength of a relationship. 


PROPOSITION 1. If X and Y are independent, then p=0, but »=0 does not imply 
independence. 
2. p = 1 or-1 iff Y = aX + b for some numbers a and b with a F 0. 


Exercise 38 and Example 5.17 relate to Statement 1, and Statement 2 is investigated in Exercises 41 
and 42(d). 

This proposition says that p is a measure of the degree of linear relationship between X and Y, 
and only when the two variables are perfectly related in a linear manner will p be as positive or 
negative as it can be. A p less than 1 in absolute value indicates only that the relationship is not 
completely linear, but there may still be a very strong nonlinear relation. Also, p = 0 does not 
imply that X and Y are independent, but only that there is complete absence of a linear relation- 
ship. When p = 0, X and Y are said to be uncorrelated. Two variables could be uncorrelated yet 
highly dependent because of a strong nonlinear relationship, so be careful not to conclude too much 
from knowing that p = 0. 


Example 5.17 In the manufacture of metal disks, small divots sometimes occur on the surface. If we 
represent the disk surface by the region x7 + y* < 7°, one possible joint density function for the 
location (X, Y) of a divot is 


3 
1) se Ve rt+y<r 


(This model reflects the fact that it’s more likely to see blemishes closer to the disk’s edge, since that’s 
where cutting has occurred.) 

Since f(x, y) is an even function of x and y, simple symmetry arguments show that E(X) = 0, 
E(Y) = 0, and E(XY) = 0, from which py y = 0. So, by definition X and Y are uncorrelated. 

However, X and Y are clearly not independent. For instance, if X = 0 (so the divot is on the 
midline), then Y can range from —r to r; however, if X ~ r (divot near the “right” edge), then Y must 
necessarily be close to 0. 

You could also verify that X and Y are not independent by determining their marginal distributions 
and observing that f(x, y) 4 fx(x) - fy), but the marginal pdfs are tedious here. | 

The next result provides an alternative view of zero correlation. 


PROPOSITION Two rvs X and Y are uncorrelated if, and only if, E[XY] = ux - by. 


Proof By its definition, Corr(X, Y) = 0 iff Cov(X, Y) = 0. Apply the covariance shortcut formula: 
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Contrast this with an earlier proposition from this section: If X and Y are independent, then 
Elg1(X)g2(Y)] = Elgi(X)] - Elg2(Y)] for all functions g, and gz. Thus, independence is stronger than 
zero correlation, the latter just being the special case corresponding to g1(X) = X and g2(Y) = Y. 


Correlation Versus Causation 

A value of p near 1 does not necessarily imply that increasing the value of X causes Y to increase. It 
implies only that large X values are associated with large Y values. For example, in the population of 
children, vocabulary size and number of cavities are quite positively correlated, but it is certainly not 
true that cavities cause vocabulary to grow. Instead, the values of both these variables tend to increase 
as the value of age, a third variable, increases. For children of a fixed age, there is probably a very low 
correlation between number of cavities and vocabulary size. In summary, association (a high cor- 
relation) is not the same as causation. 


Exercises: Section 5.2 (23-42) 

earned on the first part and Y = the number 
of points earned on the second part. Sup- 
pose that the joint pmf of X and Y is given 
in the accompanying table. 


23. The two most common types of errors 
made by programmers are syntax errors and 
logic errors. Let X denote the number of 
syntax errors and Y the number of logic 
errors on the first run of a program. Sup- 


pose X and Y have the following joint pmf p(x, y) 0 5 10 15 
for a particular programming assignment: 0 02 06 02 «10 
x 5 04 15 20 10 
x 10 Ol 15 14 01 
PO y) 0 1 2 3 
0 71 0©6.03)—C—02s—«CS . : 
yd ‘4 OO6t«C 01 a. If the score recorded an the grade book is 
2 .03 03 02 01 the total number of points earned on the 
two parts, what is the expected recorded 
a. What is the probability a program has score E(X + Y)? 
more syntax errors than logic errors on the b. If the maximum of the two scores is 
first run? recorded, what is the expected recorded 
b. Find the marginal pmfs of X and Y. score? 
: 5 
Cane ane adap eee Hew ean you 25. The difference between the number of 


tell? 

d. What is the average number of syntax 
errors in the first run of a program? What is 
the average number of logic errors? 

e. Suppose an evaluator assigns points 


customers in line at the express checkout 
and the number in line at the superexpress 
checkout in Exercise 3 is X; — X. Calcu- 
late the expected difference. 


to each program with the formula 26. Six individuals, including A and B, take 
100 —4X — 9Y. Whatis the expected point seats around a circular table in a completely 
score for a randomly selected program? random fashion. Suppose the seats are 

numbered 1, ..., 6. Let X = A’s seat number 


24. An instructor has given a short quiz con- 
sisting of two parts. For a randomly selec- 
ted student, let X = the number of points 


and Y=B’s seat number. If A sends a 
written message around the table to B in the 


direction in which they are closest, how 
many individuals (including A and B) 
would you expect to handle the message? 


27. A surveyor wishes to lay out a square 


region with each side having length 
L. However, because of measurement error, 
he instead lays out a rectangle in which the 
north-south sides both have length X and 
the east-west sides both have length 
Y. Suppose that X and Y are independent 
and that each one is uniformly distributed 
on the interval [L-— A, L+A] (where 
0 <A<L). What is the expected area of 
the resulting rectangle? 


28. Consider a small ferry that can accommo- 


date cars and buses. The toll for cars is $3, 
and the toll for buses is $10. Let X and 
Y denote the number of cars and buses, 
respectively, carried on a single trip. Sup- 
pose the joint distribution of X and Y is as 
given in the table of Exercise 9. Compute 
the expected revenue from a single trip. 


29. Annie and Alvie have agreed to meet for 


lunch between noon (0:00 p.m.) and 1:00 p.m. 
Denote Annie’s arrival time by X, Alvie’s by 
Y, and suppose X and Y are independent with 
pdfs 


fx(x) =32 O<x<1 fxly)=2y OK<y<l 


What is the expected amount of time that 
the one who arrives first must wait for the 
other person? [Hint: h(X, Y) = |X — Y|.] 


30. Suppose that X and Y are independent rvs 


with moment generating functions M,(f) 
and My(t), respectively. If Z = X + Y, show 
that M7(t) = My(t) - M(t). [Hint: Use the 
proposition on the expected value of a 
product. ] 


31. Compute the correlation coefficient p for 


X and Y of Example 5.15 (the covariance 
has already been computed). 


32. a. Compute the covariance for X and Y in 


Exercise 24. 
b. Compute p for X and Y in the same 
exercise. 
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33. 


34. 


35. 


36. 


37. 


38. 


Compute Cov(X, Y) and p for the variables 
in Exercise 11. 


Reconsider the computer component life- 
times X and Y as described in Exercise 14. 
Determine E(XY). What can be said about 
Cov(X, Y) and p? 

Referring back to Exercise 23, calculate 
both Cov(X, Y) and p. 


In practice, it is often desired to predict the 
value of a variable Y from the known value of 
some other variable, X. For example, a doctor 
might wish to predict the lifespan Y of some- 
one who smokes X cigarettes a day, or an 
engineer may require predictions of the tensile 
strength Y of steel made with concentration 
X of a certain additive. A linear predictor of 
Y is anything of the form Y = a+ bX; the 
“hat” “ on Y indicates prediction. 
A common measure of the quality of a 
predictor is given by the mean square 
prediction error, E|(Y — Y)’]. 
a. Show that the choices of a and b that 
minimize mean square prediction error are 


Oy 


b=p rs 


a = My — b- py 


where p = Corr(X, Y). The resulting 
expression for Y is often called the best 
linear predictor of Y, given X. [Hint: 
Expand the expression for mean square 
prediction error, apply linearity of 
expectation, and then use calculus. ] 

b. Determine the mean square prediction 
error for the best linear predictor. How 
does the value of p affect this quantity? 


Recalling the definition of o* for a single 
rv X, write a formula that would be appropriate 
for computing the variance of a function 
h(X, Y) of two random variables. [Hint: 
Remember that variance is just a special 
expected value.] Then use this formula to 
compute the variance of the recorded score 
h(X, Y) [= max(X, Y)] in part (b) of Exercise 24. 
Show that when X and Y are independent, 
Cov(X, Y) = Corr(X, Y) = 0. 
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39. Use linearity of expectation to establish the | 42. Consider the standardized variables Zy = 

covariance property (X — Ux)/ox and Zy = (Y — My)/oy, and let 
Cov(aX +bY + c,Z) = aCov(X,Z) pm ont th 

+ bCov(Y, Z) a. Use properties of covariance and 

; correlation to verify that Corr(X, Y) = 


40. a. Use the properties of covariance to show i aes ee y= ami bee si 
that Cov(aX + b, cY + d) = acCov(X, Y). . Use linearity of expectation along bibs 
part (a) to show that E[(Zy — pZx)] = 


b. Use part (a) along with the rescaling thse 2 ifm 1 isa eandandized 


properties standard deviation to show 


that Corr(aX + b, cY + d) = Corr(X, Y) weaberes a oe hares ™ 
when ac > 0 (this is the scale invariance can you use those to determine E(Z’)?] 


property of correlation). c. Use part (b) to show that-1 < p < 1. 


c. What happens if a and c have opposite Be se part (b) 10 Slow tial p 1 aaples 
. that Y = aX + bwherea > 0,and p = -1 
signs, so ac < 0? 


implies that Y = aX + b where a < 0. 
41. Verify that if Y=aX+b (a# 0), then 
Corr(X, Y) = +1 or —1. Under what condi- 
tions will p = +1? 


5.3. Linear Combinations 


A linear combination of random variables refers to anything of the form a,X; + --- +4,X, +, where 
the X;’s are random variables and the a;’s and b are numerical constants. (Some sources do not include the 
constant b in the definition.) For example, suppose your investment portfolio with a particular financial 
institution includes 100 shares of stock #1, 200 shares of stock #2, and 500 shares of stock #3. Let X, X>, 
and X; denote the share prices of these three stocks at the end of the current fiscal year. Suppose also that 
the financial institution will levy a management fee of $150. Then the value of your investments with this 
institution at the end of the year is 100X, + 200X2 + 500X3 — 150, which is a particular linear combi- 
nation. Important special cases include the total X;+ --- +X, (take a; = -*: =a, = 1, b = 0), the 
difference of tworvs X,—X2(n = 2,a, = 1,a) = —1,b = 0), and anything of the form aX + b (taken = 1 
or, equivalently, set a. = --- = a, = 0). Another very important linear combination is the sample mean 
X = (Xi +---+X,)/n; just take a; =*** = a, = I/nand b = 0. 

Notice that we are not requiring the X;’s to be independent or to have the same probability distribution. 
All the X;’s could have different distributions and therefore different mean values and standard devia- 
tions. In this section, we investigate the general properties of linear combinations. Section 6.2 will 
explore some special properties of the total and sample mean under additional assumptions. 

We first consider the expected value and variance of a linear combination. 


THEOREM Let the rvs Xj, X>, ..., X, have mean values 1), ..., U, and standard deviations oj, ..., 
On; Tespectively. 


1. Whether or not the X;’s are independent, 


E(a)X, + +++ +a,X, +b) = ay E(X1) + +++ ta,E(X,) +b 
= apy t+ +++ + anf, +b 
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and 


n n 


V(a,X1 + +++ +a_X_ +b) = 2 aan Xi, X;) 
a (5.5) 
= Sato? + 250 S- ajajCov(X;, X;) 
i=1 i<j 


2. If X,, ..., X, are independent, 


V(a,)X, + ++» +a,X, +b) = os ++ +aV(Xy) 


=a@ott+ ++» +a207 (5.6) 
and 
= Dine 
GayX,+--+a,X,+b = YO] + ° eta Ge, 


A paraphrase of (5.4) is that the expected value of a linear combination is the same linear combination 
of the expected values—for example, E(2X, +5X2) = 24; + Spo. The result (5.6) in Statement 2 is a 
special case of (5.5) in Statement 1: When the X;’s are independent, Cov(X;, X;) = 0 for i 4 j (this 
simplification actually occurs when the X;’s are uncorrelated, a weaker condition than independence). 


Proof (n = 2) To establish (5.4), we could invoke linearity of expectation from Section 5.2, but we 
present a direct proof here. Suppose that X, and X> are continuous with joint pdf f(x,, x2). Then 


foe) [o-@) 
E(a,X\ + a2X2 +b) = / i aX) + 2X2 + b)f (x1, x2) dx dxo 
=o [oe] 


=a, [ fw nvmdndntas ff mftennddndn 

+6 ff stoumdndn 

=a fx J too dx) dx; +a fa frees dx2 + b(1) 
=a, fsb +a | sfelndines 


= a E(X1) + a) E(X2) +b 
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Summation replaces integration in the discrete case. The argument for (5.5) does not require speci- 
fying whether either variable is discrete or continuous. Recalling that V(Y) = E[(Y — iy I, 


V(ayX1 + a2X2 +b) = El(ayX) + a2X2 +b — (apy, + arpy +b))’] 


— 


aX, — ay fly + a2X2 — ao[l)’| 
ay (X) = My)” + a3(X» = igy + 2a\az(X1 — fy) (X2 — by) 
= GEl(X1 — 11)"] + a3E[(X2 — py)"] + 2aanE[(X1 — u))(X2 — Hy) 


where the last equality comes from linearity of expectation. We recognize the terms in this last 
expression as variances and covariance, all together aiV(X1) +a5V(X2) + 2ayaxCov(X, X2), as 
required. a 


Example 5.18 A gas station sells three grades of gasoline: regular, plus, and premium. These are 
priced at $3.50, $3.65, and $3.80 per gallon, respectively. Let X;, X2, and X3 denote the amounts of 
these grades purchased (gallons) on a particular day. Suppose the X;’s are independent with 
Ly = 1000, pt = 500, “3 = 300, a, = 100, o2 = 80, and o3 = 50. The revenue from sales is 
Y = 3.5X, + 3.65X>2 + 3.8X3, and 


E(Y) = 3.5, +3.65p + 3.84, = $6465 
V(Y) = 3.5707 + 3.65o5 + 3.8703 = 243,864 


oy = \/ 243,864 = $493.83 a 


Example 5.19 Recall that a hypergeometric rv X is the number of successes in a random sample 
of size n selected without replacement from a population of size N consisting of M successes and 
N — M failures. It is tricky to obtain the mean value and variance of X directly from the pmf, and the 
hypergeometric moment generating function is very complicated. We now show how the foregoing 
proposition on linear combinations can be used to accomplish this task. 

To this end, let X, = 1 if the first individual or object selected is a success and X, = 0 if it is a 
failure; define X>, X3, ..., X,, analogously for the second selection, third selection, and so on. Each X; 
is a Bernoulli rv, and each has the same marginal distribution: p(1) = M/N and p(0) = 1 — M/N (this is 
obvious for X,, which is based on the very first draw from the population, and can be verified for the 
other draws as well). Thus E(X;) = 0(1 — M/N) + 1(M/N) = M/N. The total number of success in the 
sample is X = X;+ --: +X, (a 1 is added in for each success and a 0 for each failure), so 


E(X) = E(X1) +--+ + E(X,) =M/N+M/N+---+M/N =n(M/N) = np 


where p denotes the success probability on any particular draw (trial). That is, just as in the case of a 
binomial rv, the expected value of a hypergeometric rv is the success probability on any trial 
multiplied by the number of trials. Notice that we were able to apply Equation (5.4), even though the 
X;'s are not independent. 

Since each X; is Bernoulli, it follows that V(X;) = p(1 — p) or M/N(1 — M/N). However, the variance 
of X here is not the same as the binomial variance, precisely because the successive draws are not 
independent. Consider p(x, x2), the joint pmf of X; and X2: 
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D(a): 90-0= (Ge) (Ger): me) = 90.0 = (FET) 


This is also the joint pmf of any pair X;, X;. A slightly tedious calculation then results in 


p(1—p) 


Cov(X;, Xj) = — vaca 


fori ~j 


Applying Equation (5.5) yields 


V(X) = V(X ++» +X,) = y V(X;) +2575 ° Cov(Xi,Xj) 


i<j 


= nV(X;) +2 (3) covx. x0) 


= np(t = p)-+n(n— 1) PLP) = np —p)(—7) 


This is quite close to the binomial variance provided that n is much smaller than N so that the last term 
in parentheses is close to 1. BH 


The following corollary expresses the n = 2 case of the main theorem for ease of use, including the 
important special cases of the sum and the difference of two random variables. 


COROLLARY For any two rvs X, and X>, and any constants aj, do, b, 
E(a,X\ + a.X2 +b) = a E(X1) + @E(X2) +b 
and 
V(a,X1 +a,X_ +b) = aj V(X) +45 V(X2) + 2a,ayCov(X,, Xo) 
In particular, E(X, + X>) = E(X,) + E(X2) and, if X; and X> are independent, 


V(X, + X5) = V(X1) + V(X>).' Also, E(X, — X>) = E(X;) — E(X>) and, if X, 
and X> are independent, V(X, — X2) = V(X,) + V(X). 


The expected value of a difference is the difference of the two expected values, but the variance of a 
difference between two independent variables is the sum, not the difference, of the two variances. There 
is just as much variability in X, — X> as in X; + Xz: Writing X, — X> = X, + (—1)X2, the term (—1)X> has 
the same amount of variability as X> itself. 


Example 5.20 An automobile manufacturer equips a particular model with either a six-cylinder 
engine or a four-cylinder engine. Let X, and X> be fuel efficiencies (mpg) for independently and 
randomly selected six-cylinder and four-cylinder cars, respectively. With “, = 22, dy = 26, 0, = 1.2, 
and o> = 1.5, 


'This property of independent rvs can also be written as oj + 03 = oY, , y,. In part because the formula has the format 
2 2 2. 2, ae 2 i + 1 2 
a + b° = ’, statisticians sometimes call this property the Pythagorean Theorem. 
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V(X, — X2) = of +65 = 1.2? +15? = 3.69 
Ox,-x, = V3.69 = 1.92 mpg 


If we re-label so that X, refers to the four-cylinder car, then E(X, — X>) = 26 — 22 = 4 mpg, but the 
standard deviation of the difference is still 1.92 mpg. i 


The PDF of a Sum of Continuous RVs 

Generally speaking, knowing the mean and standard deviation of a random variable W is not enough 
to specify its probability distribution and thus be able to compute probabilities such as P(W > 10) or 
P(W < -2). In the case of independent rvs, a general method exists for determining the pdf of the 
sum X; + --- +X, from their marginal pdfs. We present first the result for two random variables. 


THEOREM Suppose X and Y are independent, continuous rvs with marginal pdfs fy(x) and fy), 
respectively. Then the pdf of the rv W = X + Y is given by 


fv) =f flaifvlw ~ xv 
[In mathematics, this integral operation is known as the convolution of f,(x) and 


Fry) and is sometimes denoted fy = fx * fy.] The limits of integration are determined 
by which x values make both f,(x) > 0 and fw — x) > 0. 


Proof Since X and Y are independent, their joint pdf is given by f(x) - fy). The cdf of W is then 
Fy(w) = P(W<w) = P(X+Y<w) 


To calculate P(X + Y < w), we must integrate over the set of numbers {(%, y): x + y < w}, which 
is the shaded region indicated in Figure 5.5. 


‘4 


Figure 5.5 Region of integration for P(X + Y < w) 
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The resulting limits of integration are -oo < x < oo and -co < y < w-—-x, and so 


Fyw(w) = P(X+Y<w) 


= / - felx)fr(y)dydx = / fx (x) J sor 
= ff setereen aya 


The pdf of W is the derivative of this expression with respect to w; taking the derivative underneath 
the integral sign yields the desired result. w 


By a similar argument, the pdf of W = X + Y can be determined even when X and Y are not 
independent. Assuming X and Y have joint pdf f(x, y), fw(w) = [oo f(x, w — x)dx. 


Example 5.21 Ina standby system, a component is used until it wears out and is then immediately 
replaced by another, not necessarily identical, component. (The second component is said to be “in 
standby mode,” 1.e., waiting to be used.) The overall lifetime of a standby system is just the sum of the 
lifetimes of its individual components. Let X and Y denote the lifetimes of the two components of a 
standby system, and suppose X and Y are independent exponentially distributed random variables 
with mean lifetimes 3 weeks and 4 weeks, respectively. Let W = X + Y, the system lifetime. 

Using Equation (5.4), the expected lifetime of the standby system is E(W) = E(X) + E(Y) = 
3 + 4=7 weeks. Since X and Y are exponential, the variance of each one is the square of its mean (9 and 
16, respectively); since they are also independent, 


V(W) = V(X) 4+ V(Y) = 37447 = 25 


It follows that ow = 5 weeks. Since uw # ow, W cannot itself be exponentially distributed, but we 
can use the previous theorem to find its pdf. 

The marginal pdfs of X and Y are fx(x) = (1/3)e*” for x > 0 and fity) = (1/4)e> for y > 0. 
Substituting y = w — x, the inequalities x > 0 and w—x > 0 imply 0 < x < w, which specify the limits 
of integration of the convolution integral: 


Ww Ww 


fiw) =f fileyfelw—a)de= f (1/3) (1/)e Ode = eres fea 
—oo 0 0 
= err _ eo) w>0 


A graph of this pdf appears in Figure 5.6. As a check, the mean and variance of W can be verified 
directly from its pdf. 
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Figure 5.6 The pdf of W = X + Y for Example 5.21 


The probability that the system lasts more than its expected lifetime of 7 weeks is given by 


co 


P(W>7)= J fold = ; eW/4(1 — e¥/!2\ dw = 4042 a 
7 7 


As a generalization of the previous proposition, the pdf of the sum W = X,+ --- +X, of n inde- 
pendent, continuous rvs can be determined by successive convolution: fy = fi * --- + f,. In most 
situations, it isn’t practical to evaluate such a complicated object. Thankfully, as we’ll see next, such 
tedious computations can sometimes be avoided with the use of moment generating functions. 


Moment Generating Functions for Linear Combinations 

A proposition in Section 5.2 stated that the expected value of a product of functions of independent 
random variables is the product of the individual expected values. We now use this to formulate the 
moment generating function of a linear combination of independent random variables. 


PROPOSITION Let Xi, Xo, ..., X, be independent rvs with moment generating functions 
My((t), My,(t), ..., Mx,(t), respectively. Then the moment generating function of 
Y = a,X1 + aoX2 + + + aX, + bis 


My(t) = e” « Mx, (ait) -Mx,(aat) - »-- » Mx, (ant) 
In the special case that aj = a2 = +» =a, = 1 andb=0, so Y= X, + °° + X,, 
My(t) = My, (t) -My,(t)- ++ - Mx, (t) 


That is, the mgf of a sum of independent rvs is the product of the individual 
mefs. 
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Proof First, we write the moment generating function of Y as the expected value of a product. 


My(t) = Ele’ — Ele(aXi + 2X2 ieee EO) 


— Ei eas eee ae) = eo E[e™ 7 etka yeaa enh) 


The last expression inside brackets is the product of functions of X1, X2, ..., X,. Since the X;’s are 
independent, the expected value can be distributed across this product: 


eo Eel , olor. bent efanXn) _ eo Eel X1] . Efe?) seuss Ele’@"*" 


= e”' Mx, (ait) . My, (aot) stn My, (ant) |_| 


Now suppose we wish to determine the pdf of some linear combination of independent rvs. Provided 
we have their mgfs, the previous proposition makes it easy to determine the mgf of the linear 
combination. Then, if we can recognize this mgf as belonging to some known distributional family 
(binomial, exponential, etc.), the uniqueness property of mgfs guarantees our linear combination has 
that particular distribution. The next several propositions illustrate this technique. 


PROPOSITION If X,, Xo, ..., X, are independent, normally distributed rvs (with possibly 
different means and/or sds), then any linear combination of the X;’s also 
has a normal distribution. In particular, the sum of independent normally 
distributed rvs itself has a normal distribution, and the difference X,; — X> 
between two independent, normally distributed variables is itself normally 
distributed. 


Proof Let Y = a,X, + aX. + + + a,X, + b, where X; is normally distributed with mean p; and 


standard deviation o;, and the X; are independent. From Section 4.3, My,(t) = eM? toh /2 
Therefore, 


bt 
My(t) =€ Mx, (ait) . Mx, (a2t) sree - Mx, (ant) 
= elt oliat + a a7t /2 pisant + 03030 /2 Seiten’ all tat + Gah [2 
= ea + fg y ++ + fy Gn + b)t + (ot at + o3a5 +--+ 02a) P /2 
__ t+ 07? 2: 
=e 2 


where f= aifly + 42M) + -** +n, +b and o* = ajo} +aso,4+ ++» +a2o2. We recognize this 
function as the mgf of a normal random variable, and it follows that Y is normally distributed by the 
uniqueness property of mgfs. Notice that the mean and variance are in agreement with the first 
proposition of this section. re] 


Example 5.22 (Example 5.18 continued) The total revenue from the sale of the three grades of 
gasoline on a particular day was Y = 3.5X, + 3.65X, + 3.8X3, and we calculated py = $6465 and 
(assuming independence) cy = $493.83. If the X;’s are (approximately) normally distributed, the 
probability that revenue exceeds $5000 is 


5.3. Linear Combinations 311 


5000 — 6465 


P(Y w P(Z 
rane) ( > 493.83 


) =P(Z > —2.967) =1— ®(—2.967)=.9985 fl 


This same method may be applied to discrete rvs, as the next proposition indicates. 


PROPOSITION Suppose Xj, ..., X,, are independent Poisson random variables, where X; has mean 
u; Then Y=X,+---+X, also has a Poisson distribution, with mean 
by tot My 


(e'-1) 


Proof From Section 3.6, the mgf of a Poisson rv with mean y is e” . Since Y is the sum of the 


X;’s, and the X;’s are independent, 
My(t) = Mx,(t) - +++ >My, (t) = et). 0. el“ D = ele to te ED) 


This is the mgf of a Poisson rv with mean yp, + --- + u,. Therefore, by the uniqueness property of 
megfs, Y has a Poisson distribution with mean p,+ --- + M,. 


Example 5.23 During the open enrollment period at a large university, the number of freshmen 
registering for classes through the online registration system in one hour follows a Poisson distri- 
bution with mean 80 students; denote this rv by X,. Define X>, X3, and X4 similarly for sophomores, 
juniors, and seniors, and suppose the corresponding means are 125, 118, and 140, respectively. 
Assume these four counts are independent. The rv Y = X, + X2 + X3 + X4 represents the total number 
of undergraduate students registering in one hour; by the preceding proposition, Y is also a Poisson rv, 
but with mean 80 + 125 + 118 + 140 = 463 students and standard deviation \/463 = 21.5 students. 
The probability that more than 500 students enroll during one hour, exceeding the registration 
system’s capacity, is then P(Y > 500) = 1 — P(Y < 500) = .042 (using software). a 


Because of the properties stated in the preceding two propositions, both the normal and Poisson 
models are sometimes called additive distributions, meaning that the sum of independent rvs from 
that family (normal or Poisson) will also belong to that family. The next proposition shows that not all 
of the major probability distributions are additive; its proof is left as an exercise (Exercise 65). 


PROPOSITION Suppose Xj, ..., X,, are independent exponential random variables with common 
parameter 2. Then Y = X;+ --- +X, has a gamma distribution, with parameters 
a=nand f= 1//. 


Therefore, the exponential distribution is not additive, although it can be shown that its “parent,” the 
gamma distribution, is additive under certain conditions (see Exercise 64). Notice that this proposition 
requires the X;’s to have the same “rate” parameter /; i.e., the X;’s must be independent and iden- 
tically distributed for their sum to have a gamma distribution. As we saw in Example 5.21, the sum of 
two independent exponential rvs with different parameter values follows neither an exponential nor a 
gamma distribution. 
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Exercises: Section 5.3 (43-67) 


43. 


44. 


A shipping company handles containers in 
three different sizes: (1) 27 ft® (3 x 3 x 3), 
(2) 125 ft*, and (3) 512 ft*. Let X; (i = 1, 2,3) 
denote the number of type i containers 
shipped during a given week. With py; = 
E(X;) and o? = V(X;), suppose that the mean 
values and standard deviations are as 
follows: 


Hy = 200 py =250 pw; = 100 
0,=10 o=12 o3=8 


a. Assuming that X,, X, X3 are indepen- 
dent, calculate the expected value and 
standard deviation of the total volume 
shipped. [Hint: Volume = 27X, + 
125X, + 512X3.] 

b. Would your calculations necessarily be 
correct if the X;’s were not independent? 
Explain. 

c. Suppose the X;’s are independent with 
each one (approximately) normal. What is 
the (approximate) probability that the total 
volume shipped is at most 100,000 ft? 


Let X,, X2, and X3 represent the times 
necessary to perform three successive 
repair tasks at a service facility. Suppose 
they are independent, normal rvs_ with 
expected values [j, Mo, and plz and vari- 
ances ee a, and Gs, respectively. 

a. Tf fy = fy = 1 = 60, of = 03 = 03 = 15, 
calculate P(X, + X2, +X3 < 200). 

b. Using the y;’s and o;’s from (a), what is 
PSO < X, + X, + X3 < 200)? 

c. Using the p,;’s and o;’s given in part (a), 
calculate P(55 < X) and P(58 < X < 62). 
[Hint: X = (X, + X2 + X3)/3.] 

d. Using the values from part (a), calculate 
P(-10 < X, — 5X2 - 5X3 < 5). 

e. If 4 = 40, po = 50, 3 = 60, of = 10, 
o, = 12, anda; = 14, 
P(X, + Xp + X3 < 160) 
P(X, + Xp > 2X3). 


calculate 


and also 
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Five automobiles of the same type are to be 
driven on a 300-mile trip. The first two 
have six-cylinder engines, and the other 
three have four-cylinder engines. Let X,, 
X>, X3, X4, and Xs be the observed fuel 
efficiencies (mpg) for the five cars. Sup- 
pose these variables are independent and 
normally distributed with pw) = Lz = 30, 
Ly = 4 = Ls = 35, and o = 2.5 for the two 
larger engines and 3.6 for the three smaller 
engines. Define a rv Y by 


Xi +X. Xy +X +X5 


Y 
2 3 


so that Y is a measure of the difference in 
efficiency between the six-cylinder and 
four-cylinder engines. Compute P(Y > 0) 
and P(-3 < Y < 3). [Hint: Y is a linear 
combination; what are the a;’s?] 


Exercise 28 introduced random variables 
X and Y, and the number of cars and 
buses, respectively, carried by a ferry on 
a single trip. These rvs are, in fact, 
independent. 


a. Compute the expected value, variance, 
and standard deviation of the total 
number of vehicles on a single trip. 

b. If each car is charged $3 and each 
bus $10, compute the expected value, 
variance, and standard deviation of the 
revenue resulting from a single trip. 


A concert has three pieces of music to be 
played before intermission. The time taken 
to play each piece has a normal distribution. 
Assume that the three times are indepen- 
dent of each other. The mean times are 15, 
30, and 20 min, respectively, and the 
standard deviations are 1, 2, and 1.5 min, 
respectively. What is the probability that 
this part of the concert takes at most one 
hour? Are there reasons to question the 
independence assumption? Explain. 
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Refer to Exercise 3. 


a. Calculate the covariance between X, = 
the number of customers in the express 
checkout and X, = the number of cus- 
tomers in the superexpress checkout. 

b. Calculate V(X, + X2). How does this 
compare to V(X,) + V(X)? 


Suppose your waiting time for a bus in the 
morning is uniformly distributed on [0, 8], 
whereas waiting time in the evening is 
uniformly distributed on [0, 10] indepen- 
dent of morning waiting time. 


a. If you take the bus each morning and 
evening for a week, what is your total 
expected waiting time? [Hint: Define rvs 
X, ..., Xjo and use a rule of expected 
value. | 

b. What is the variance of your total wait- 
ing time? 

c. What are the expected value and variance 
of the difference between morning and 
evening waiting times on a given day? 

d. What are the expected value and vari- 
ance of the difference between total 
morning waiting time and total evening 
waiting time for a particular week? 


An insurance office buys paper by the ream 
(500 sheets) for use in the copier, fax, and 
printer. Each ream lasts an average of 
4 days, with standard deviation | day. The 
distribution is normal, independent of pre- 
vious reams. 


a. Find the probability that the next ream 
outlasts the present one by more than 
two days. 

b. How many reams must be purchased if 
they are to last at least 60 days with 
probability at least 80%? 


If two loads are applied to a cantilever 
beam as shown in the accompanying 
drawing, the bending moment at 0 due to 
the loads is aX, + a)X>. 
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xX; Xy 
‘ } 
ay a 


PS oni 


. Suppose that X; and X> are independent 
rvs with means 2 and 4 kips, respec- 
tively, and standard deviations .5 and 1.0 
kip, respectively. If a,=5 ft and 
daz = 10 ft, what is the expected bending 
moment and what is the standard devi- 
ation of the bending moment? 

b. If X, and X> are normally distributed, 
what is the probability that the bending 
moment will exceed 75 kip-ft? 

c. Suppose the positions of the two loads 
are random variables. Denoting them by 
A, and A>, assume that these variables 
have means of 5 and 10 ft, respectively, 
that each has a standard deviation of .5, 
and that all A;’s and X;’s are independent 
of each other. What is the expected 
moment now? 

d. For the situation of part (c), what is the 
variance of the bending moment? 

e. If the situation is as described in part 

(a) except that Corr(X, Xz) = .5 (so that 

the two loads are not independent), 

what is the variance of the bending 
moment? 


One piece of PVC pipe is to be inserted 
inside another piece. The length of the first 
piece is normally distributed with mean 
value 20 in. and standard deviation .5 in. 
The length of the second piece is a normal 
rv with mean and standard deviation 15 in. 
and .4 in., respectively. The amount of 
overlap is normally distributed with mean 
value 1 in. and standard deviation .1 in. 
Assuming that the lengths and amount of 
overlap are independent of each other, 
what is the probability that the total 
length after insertion is between 34.5 in. 
and 35 in.? 
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Two airplanes are flying in the same 
direction in adjacent parallel corridors. At 
time ¢ = 0, the first airplane is 10 km ahead 
of the second one. Suppose the speed of the 
first plane (km/h) is normally distributed 
with mean 520 and standard deviation 10 
and the second plane’s speed, independent 
of the first, is also normally distributed with 
mean and standard deviation 500 and 10, 
respectively. 


a. What is the probability that after 2 h of 
flying, the second plane has not caught 
up to the first plane? 

b. Determine the probability that the planes 
are separated by at most 10 km after 
2h. 


Three different roads feed into a particular 
freeway entrance. Suppose that during a 
fixed time period, the number of cars 
coming from each road onto the freeway 
is a random variable, with expected value 
and standard deviation as given in the 
table. 


Road 1 Road 2 Road 3 
Expected value 800 1000 600 
Standard deviation 16 25 18 


a. What is the expected total number of 
cars entering the freeway at this point 
during the period? [Hint: Let X; = the 
number from road i.] 

b. What is the standard deviation of the 
total number of entering cars? Have you 
made any assumptions about the rela- 
tionship between the numbers of cars on 
the different roads? 

c. With X; denoting the number of cars enter- 
ing from road i during the period, suppose 
Cov(X,, X2) = 80, Cov(X), X3) = 90, and 
Cov(X>, X3) = 100 (so that the three 
streams of traffic are not independent). 
Compute the expected total number of 
entering cars and the standard deviation of 
the total. 
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Consider independent rvs X,...,X, from a 
continuous distribution having median 0, so 
that the probability of any one observation 
being positive is .5. Now disregard the signs 
of the observations, rank them from smallest 
to largest in absolute value, and then let 
W = the sum of the ranks of the observa- 
tions having positive signs. For example, if 
the observations are —.3, +.7, +2.1, and —2.5, 
then the ranks of positive observations are 2 
and 3, so W = 5. In the statistics literature, 
Wis called Wilcoxon’s signed-rank statistic. 
W can be represented as follows: 


W=1-¥,+2-%+3-¥a++-++n- 
n 
yey 
i=1 


where the Y;’s are independent Bernoulli 
rvs, each with p = .5 (Y; = | corresponds to 
the observation with rank 7 being positive). 
Compute the following: 


Yn 


a. E(Y;) and then E(W) using the equation 
for W [Hint: The first n positive integers 
sum to n(n + 1)/2.] 

b. V(Y;) and then V(W) [Hint: The sum of 
the squares of the first n positive integers 
is n(n + 1)(2n + 1)/6.] 


In Exercise 51, the weight of the beam itself 
contributes to the bending moment. 
Assume that the beam is of uniform thick- 
ness and density so that the resulting load is 
uniformly distributed on the beam. If the 
weight of the beam is random, the resulting 
load from the weight is also random; denote 
this load by W (kip-ft). 


a. If the beam is 12 ft long, Whas mean 1.5 
and standard deviation .25, and the fixed 
loads are as described in part (a) of 
Exercise 51, what are the expected value 
and variance of the bending moment? 
[Hint: If the load due to the beam were 
w kip-ft, the contribution to the bending 


moment would be w if * xdx.] 
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b. If all three variables (X,, X>, and W) are 
normally distributed, what is the proba- 
bility that the bending moment will be at 
most 200 kip-ft? 


A professor has three errands to take care of 
in the Administration Building. Let X; = 
the time that it takes for the ith errand 
(i = 1, 2, 3), and let X4 = the total time in 
minutes that she spends walking to and 
from the building and between each errand. 
Suppose the X;’s are independent, normally 
distributed, with the following means and 
standard deviations: w,=15, oo, =4, 
Ha =5, 02=1, gp =8, 03 =2, fy = 12, 
o4= 3. She plans to leave her office at 
precisely 10:00 a.m. and wishes to post a 
note on her door that reads, “I will return by 
t a.m.” What time ft should she write down 
if she wants the probability of her arriving 
after ¢ to be .01? 


In an area having sandy soil, 50 small trees 
of a certain type were planted, and another 
50 trees were planted in an area having clay 
soil. Let X = the number of trees planted in 
sandy soil that survive 1 year and Y = the 
number of trees planted in clay soil that 
survive | year. If the probability that a tree 
planted in sandy soil will survive 1 year is 
.7 and the probability of 1-year survival in 
clay soil is .6, compute an approximation to 
PCS < X-Y < 5). [Hint: Use a normal 
approximation from Section 3.3. Do not 
bother with the continuity correction.] 


Let X and Y be independent rvs, with 
X ~ N(O, 1) and Y ~ N(O, 1). 


a. Use convolution to show that X + Y is 
also normal, and identify its mean and 
standard deviation. 

b. Use the additive property of the normal 
distribution presented in this section to 
verify your answer to part (a). 


Karen throws two darts at a board with radius 
10 in.; let X and Y denote the distances of the 
two darts from the center of the board. Under 
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the system Karen uses, the score she 
obtains depends on W = X + Y, the sum of 
these two distances. Assume X and Y are 
independent. 


a. If X and Y are both uniform distributed 
on the interval [0, 10], use convolution 
to determine the pdf of W= X + Y. Be 
very careful with your limits of 
integration! 

b. Based on the pdf in part (a), calculate 
P(X+Y < 5). 

c. If Karen’s darts are equally likely to land 
anywhere on the board, it can be shown 
that the pdfs of X and Y are fy(x) = x/50 
for O < x < 10 and fy) = y/50 for 
0 < y< 10. Use convolution to 
determine the pdf of W = X + Y. Then, 
calculate P(X + Y < 5). 


Siblings Matt and Liz both enjoy playing 
roulette. One day, Matt brought $10 to the 
local casino and Liz brought $15. They sat 
at different tables, each made $1 wagers on 
red on consecutive spins (10 spins for Matt, 
15 for Liz). Let X = the number of times 
Matt won and Y = the number of times Liz 
won. 


a. What is a reasonable probability model 
for X? [Hint: Successive spins of a 
roulette wheel are independent, and 
Pdand on red) = 18/38.] 

b. What is a reasonable probability model 
for Y? 

c. What is a reasonable probability model 
for X + Y, the total number of times 
Matt and Liz win that day? Explain. 
[Hint: Since the siblings sat at different 
table, their gambling results are 
independent. ] 

d. Use moment generating functions, along 
with your answers to (a) and (b), to show 
that your answer to part (c) is correct. 

e. Generalize part (d): If Xi, ..., XX, 
are independent binomial rvs, with 
X; ~ Bin(n, p), show that their sum is 
also binomially distributed. 
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f. Does the result of part (e) hold if the 
parameter p has a different value for each 
X; (e.g., if Matt bets on red but Liz bets 
on the number 27)? 


The children attending Milena’s birthday 
party are enjoying taking swings at a pifiata. 
Let X =the number of swings it takes 
Milena to hit the pifiata once (since she’s 
the birthday girl, she goes first), and let 
Y= the number of swings it takes her 
brother Lucas to hit the pifiata once (he 
goes second). Assume the results of suc- 
cessive swings are independent (the chil- 
dren don’t improve, since  they’re 
blindfolded), and that each child has a .2 
probability of hitting the pifiata on any 
attempt. 


a. What is a reasonable probability model 
for X? 

b. What is a reasonable probability model 
for Y? 

c. What is a reasonable probability model 
for X + Y, the total number of swings 
taken by Milena and Lucas? Explain. 
(Assume Milena’s and Lucas’ results are 
independent.) 

d. Use moment generating functions, along 
with your answers to (a) and (b), to 
show that your answer to part (c) is 
correct. 

e. Generalize part (d): If X), ..., X, are 
independent geometric rvs with common 
parameter p, show that their sum has a 
negative binomial distribution. 

f. Does the result of part (e) hold if the 
probability parameter p is different for 
each X; (e.g., if Milena has probability .4 
on each attempt while Lucas’ success 
probability is only .1)? 


Let X, ..., X, be independent rvs, with X; 
having a negative binomial distribution 
with parameters 7; and p (i = 1, ..., n). Use 
moment generating functions to show that 
Xi+---+X, has a negative binomial 
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distribution, and identify the parameters of 
this distribution. Explain why this answer 
makes sense, based on the negative bino- 
mial model. [Note: Each X; may have a 
different parameter r;, but all have the same 
Pp parameter. ] 


Let X and Y be independent gamma random 
variables, both with the same scale param- 
eter B. The value of the other parameter is 0 
for X and « for Y. Use moment generating 
functions to show that X + Yis also gamma 
distributed, with shape parameter «, + a 
and scale parameter f. Is X + Y gamma 
distributed if the scale parameters are dif- 
ferent? Explain. 


Let X and Y be independent exponen- 
tial random variables with common 
parameter /. 


a. Use convolution to show that X + Y has 
a gamma distribution, and identify the 
parameters of that gamma distribution. 

b. Use the previous exercise to establish 
the same result. 

c. Generalize part (b): If X;, ..., X, are 
independent exponential rvs with com- 
mon parameter A, what is the distribu- 
tion of their sum? 


For men, pulse rates (in beats per minute) 
are normally distributed with mean 70 and 
standard deviation 10. Women’s pulse rates 
are normally distributed with mean 77 and 
standard deviation 12. Let X = the sample 
average pulse rate for a random sample of 
40 men and let Y = the sample average 
pulse rate for a random sample of 36 
women. 


a. What is the distribution of X ? Of Y? 
[Hint: X = 4X, + --- +4X40, and sim- 
ilarly for Y.] 

b. What is the distribution of X — Y? Justify 
your answer. 

c. Calculate P(-2<<X —Y <1). 
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d. Calculate P(X —Y< —15). If you c. Now let Y, = X,+ --- +X, where the 
actually observed X — Y< —15, would X; are iid Laplace rvs. Determine the 
you doubt that yz; — 2 = -7? Explain. mean, variance, and mef of Y,,. 


d. Define a standardized version of Y,, by 


67. The Laplace (or double exponential) Zn = (Yq — bly. )/oy,. Determine the mgf 


distribution has pdf f(x) =1e~"! for af 2. 
00 <X <0. e. Show that as n — oo, the limiting mgf 
a. The mean of the Laplace distribution is of Z, is e*/?, the mgf of a standard 
clearly 0, by symmetry. Determine the normal rv. 
variance of the Laplace distribution. (This is a preview of the celebrated Central 
b. Show that the mgf of the Laplace Limit Theorem, which we’ll encounter in 
distribution is My(t) = 1/(1—?) for Chapter 6.) 
-l<r<l. 


5.4 Conditional Distributions and Conditional Expectation 


The distribution of Y can depend strongly on the value of another variable X. For example, if X is 
height and Y is weight, the weight distribution for men who are 6 ft tall is very different from the 
weight distribution for short men. The conditional distribution of Y given X = x describes for each 
possible x how probability is distributed over the set of possible y values. We define the conditional 
distribution of Y given X, but the conditional distribution of X given Y can be obtained by just 
reversing the roles of X and Y. Both definitions are analogous to that of the conditional probability 
P(A|B) as the ratio P(AM B)/P(B). 


DEFINITION Let X and Y be two discrete random variables with joint pmf p(x, y) and marginal 
X pmf px(x). Then for any x value such that p(x) > 0, the conditional probability 
mass function of Y given X = x is 


Py\x(y|x) = 


An analogous formula holds in the continuous case. Let X and Y be two continuous 
random variables with joint pdf f(x,y) and marginal X pdf f(x). Then for any x value 
such that f(x) > 0, the conditional probability density function of Y given X = x is 


_ f(x,y) 
frx (|x) = f(x) 


Example 5.24 For a discrete example, reconsider Example 5.1, where X represents the deductible 
amount on an automobile policy and Y represents the deductible amount on a homeowner’s policy. 
Here is the joint distribution again. 
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p(x, y) 0 100 200 
100 20 10 20 
~ 250 05 15 30 


The distribution of Y depends on X. In particular, let’s find the conditional probability that Y is 200, 
given that X is 250, using the definition of conditional probability from Section 2.4: 


P(Y = 200X = 250) 3 


P(Y = 200|X = 250 = = 6 
( | ) P(X = 250) O15 3 


With our new definition we obtain the same result: 


p(250, 200) 3 


px(250) 054.1543. °° 


pyjx(200|250) = 


The conditional probabilities for the two other possible values of Y are 


p(250, 0) 05 
0|250) = 2 =A 
Pyix(01250) =" 950) ~ 054.1543 


p(250, 100) 1S 
100|250) = = =% 
Pyix(100|250) =" (050) 0541543 


Thus, py|x(0|250) + py|x(100|250) + pyx(200|250) = .1+.3+.6=1. This is no coincidence; 
conditional probabilities satisfy the properties of ordinary probabilities. They are nonnegative and 
they sum to 1. Essentially, the denominator in the definition of conditional probability is designed to 
make the total be 1. 

Reversing the roles of X and Y, we find the conditional probabilities for X, given that Y = 0: 


p(100,0) —-.20 
100|0) = = =5 
Pxiy(100|0) =~ 0) > 204.05 
p(250,0) 05 
250|0) = = = 8 
Px(250|0) = 0) = 204.05 


Again, the conditional probabilities add to 1. a 


Example 5.25 For a continuous example, recall Example 5.5, where X is the weight of almonds and 
Y is the weight of cashews in a can of mixed nuts. The sum of X + Y is at most one pound, the total 
weight of the can of nuts. The joint pdf of X and Y is 


f(x,y) =24xy O<a<1, O<y<1, x+y<1 
In Example 5.5 it was shown that 


f(x) = 12x(1-—x)? O0<x<1 
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The conditional pdf of Y given that X = x is 


_f(my)_ My oy 7 
fax) fx(®) —12x(1-x)? (1 —x)* Osysl~x 


This can be used to calculate conditional probabilities for Y. For example, 


25 .25 


P(Y <.25|X =.5) = / fox(y|-5)dy = / — dy = [497] = .25 
ce ) (1 — .5) 
Given that the weight of almonds (X) is .5 lb, the probability is .25 for the weight of cashews (Y) to be 
less than .25 lb. 
Just as in the discrete case, the conditional distribution assigns a total probability of 1 to the set of 
all possible Y values. That is, integrating the conditional density over its set of possible values should 


yield 1: 
foe) 1l-—x > 4 J—x 
_ y _ us _ 
J fdioar= | ina rend =1 


0 


Whenever you calculate a conditional density, we recommend doing this integration as a validity 
check. | 


Conditional Distributions and Independence 

Recall that in Section 5.1 two random variables were defined to be independent if their joint pmf or 
pdf factors into the product of the marginal pmfs or pdfs. We can understand this definition better with 
the help of conditional distributions. For example, suppose there is independence in the discrete case. 
Then 


Ply) _ Px(@)pry) _ yy 
Px(x) Px(x) " 


Py|x(y|x) = 


That is, independence implies that the conditional distribution of Y is the same as the unconditional 
(i.e., marginal) distribution, and that this is true no matter the value of X. The implication works in the 
other direction, too. If pyjx(y|x) = py(y), then 


me = py(y) 


so p(x, y) = px(x) py(y), and therefore X and Y are independent. 


In Example 5.7 we said that independence necessitates the region of positive density being a 
rectangle (possibly infinite in extent). In terms of conditional distributions, this region tells us the 
domain of Y for each possible x value. For independence we need to have the domain of Y (the 
interval of positive density) be the same for each x, implying a rectangular region. 
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Conditional Expectation and Variance 
Because the conditional distribution is a valid probability distribution, it makes sense to define the 
conditional mean and variance. 


DEFINITION Let X and Y be two discrete random variables with conditional probability mass 
function pyx(y|x). Then the conditional expectation (or conditional mean) of 
Y given X = x is 


Myjx=x = E(Y|X = x) =o Py|x(y|x) 


Analogously, for two continuous rvs X and Y with conditional probability density 


function fyx(|x), 
Hex = EO =2) =f y-Fyxlobay 


More generally, the conditional mean of any function A(Y) is given by 


S“[a(y) - Py|x(y|x)] (discrete case) 


y 


E(h(Y)|X=x)=< & 
fro -frix(y|x)dy (continuous case) 


—oo 


In particular, the conditional variance of Y given X = x is 


2 
Oyyar = V(VIX = x) = El(¥ — Myyay) |X = 4] = E(Y?|X = x) — py 


Example 5.26 Having previously found the conditional distribution of Y given X = 250 in Example 
5.24, let’s compute the conditional mean and variance. 


y|x-259 = E(Y|X = 250) = Opyix(0|250) + 100pyjx(100|250) 
+ 200pyjx(200|250) = 0(.1) + 100(.3) + 200(.6) = 150 


The average homeowner’s policy deductible, among customers with a $250 auto deductible, is $150. 
Given that the possibilities for Y are 0, 100, and 200 and most of the probability is on the latter two 
values, it is reasonable that the conditional mean should be between 100 and 200. 

Using the alternative (shortcut) formula for the conditional variance requires first obtaining the 
conditional expectation of Y”: 


E(Y?|X = 250) = 0°pyjx(0|250) + 1007 py}x(100|250) + 200*pyjx(200|250) 
= 0°(.1) + 1007(.3) + 200°(.6) = 27,000 
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Thus, 
OF x~050 = V(¥|X = 250) = E(¥?|X = 250) — pi \y-959 = 27,000 — 150° = 4500. 


Taking the square root gives ¢y\y~59 = $67.08, which is in the right ballpark when we recall that the 
possible values of Y are 0, 100, and 200. | 


Example 5.27 (Example 5.25 continued) Suppose a 1-lb can of mixed nuts contains .1 lbs of 
almonds (i.e., we know that X = .1). Given this information, the amount of cashews Y in the can is 
constrained by 0 < y < 1—x=.9, and the expected amount of cashews in such a can is 


9 9 
2y 
Biyx=.1)= f y-fr(l-t)ay= fy say = 6 
(1 —.1) 
0 0 
The conditional variance of Y given that X = .1 is 
9 9 5% 
vere =.1)= f 0-6) -fneolday= fo qo p= 45 
0 0 , 


Using the aforementioned shortcut, this can also be calculated in two steps: 


9 


B(x =A) = fy? Flo} t)ey 
0 
&9 


2 
, (1—-1) 
=> V(Y|X = .1) = .405 — (.6)? = .045 


More generally, conditional on X = x lbs (where 0 < x < 1), integrals similar to those above can be 
used to show that the conditional mean amount of cashews is 2(1 — x)/3, and the corresponding 
conditional variance is (1 — x)*/18. This formula implies that the variance gets smaller as the weight of 
almonds (x) in a can approaches | Ib. Does this make sense? When the weight of almonds is 1 Ib, the 
weight of cashews is guaranteed to be O, implying that the variance is 0. Indeed, Figure 5.2 shows 
that the set of possible y values narrows to 0 as x approaches 1. Hi 


The Laws of Total Expectation and Variance 

By the definition of conditional expectation, the rv Y has a conditional mean for every possible value 
x of the variable X. In Example 5.26, we determined the mean of Y given that X = 250, but a different 
mean would result if we conditioned on X = 100. For the continuous rvs in Example 5.27, every 
value x between O and 1 yielded a different conditional mean of Y (and, in fact, we even found a 
general formula for this conditional expectation). As it turns out, these conditional means can be 
related back to the unconditional mean of Y, i.e., wy. Our next example illustrates the connection. 
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Example 5.28 Apartments in a certain city have x = 0, 1, 2, or 3 bedrooms (0 for a studio apart- 
ment), and y = 1, 1.5, or 2 bathrooms. The accompanying table gives the proportions of apartments 
for the various number of bedroom/number of bathroom combinations. 


y 
p@ y) 1 1.5 2 
0 10 .00 00 1 
1 20 .08 02 3 
x 2 15 .10 15 4 
3 05 .05 10 2 
50 23 21 


Let X and Y denote the number of bedrooms and bathrooms, respectively, in a randomly selected 
apartment in this city. The marginal distribution of Y comes from the column totals in the joint 
probability table, from which it is easily verified that E(Y) = 1.385 and V(Y) = .179275. The con- 
ditional distributions (pmfs) of Y given that X = x for x = 0, 1, 2, and 3 are as follows: 


x=0: py|x=o(1)=1 (all studio apartments have one bathroom) 
x=1: Py|x=1(1) = .667, Pyix=1(1.5) = .267, Py|x=1(2) = .067 
= 2: pyy=2(1) = .375, pyro (1-5) =.25, — pyjx=2(2) = .375 


x= 3: Py|x=3(1) = 25, Py|x=3(1.5) = 29; Py|x=3(2) = 50 


From these conditional pmfs, we obtain the expected value of Y given X = x for each of the four 
possible x values: 


E(Y|X=0)=1, E(¥|X=1)=1.2, E(¥|X=2)=1.5, E(¥|X =3) = 1.625 


So, on the average, studio apartments have | bathroom, one-bedroom apartments have 1.2 bathrooms, 
2-bedroom apartments have 1.5 baths, and luxurious 3-bedroom apartments have 1.625 baths. 

Now, instead of writing E(Y|X = x) for some specific value x, let’s consider the expected number 
of bathrooms for an apartment of randomly selected size, X. This expectation, denoted E(Y|X), is itself 
a random variable, since it is a function of the random quantity X. Its smallest possible value is 1, 
which occurs when X = 0, and that happens with probability .1 (the sum of probabilities in the first 
row of the joint probability table). Similarly, the random variable E(Y|X) takes on the value 1.2 with 
probability px(1) = .3. Continuing in this manner, the probability distribution of the rv E(Y|X) is as 
follows: 


Value of E(Y|X) 1 1.2 1.5 1.625 
Probability of value wl S) 4 ov) 


The expected value of this random variable, denoted E[E(Y|X)], is computed by taking the 
weighted average of the four values of E(Y|X = x) against the probabilities specified by px(x), as 
suggested by the preceding table: 
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E[E(Y|X)] = 1(.1) + 1.2(.3) + 1.5(.4) + 1.625(.2) = 1.385 


But this is exactly E(Y), the expected number of bathrooms. a 


LAW OF TOTAL For any two random variables X and Y, 
EXPECTATION 
E\E(Y|X)] = E(Y) 


(This is sometimes referred to as computing E(Y) by means of 
iterated expectation.) 


The Law of Total Expectation says that E(Y) is a weighted average of the conditional means 
E(Y|X = x), where the weights are given by the pmf or pdf of X. It is analogous to the Law of 
Total Probability, which describes how to find P(B) as a weighted average of conditional probabilities 
P(BIA)). 


Proof Here is the proof when both rvs are discrete; in the jointly continuous case, simply replace 
summation by integration and pmfs by pdfs. 


E[E(Y|X)] = S~ E(Y|X = x)px(x) = S> Se ypyix(vlx)px(x) 


xe€Dy x€D x yeDy 

_ P(x y) Zp = _ _ 

=o, (x) = Soy >) ply) = SS wr) = E(Y) 2 
x€Dy yeDy yeDy xeDy yeDy 


In Example 5.28, the use of iterated expectation to compute E(Y) is unnecessarily cumbersome; 
working from the marginal pmf of Y is more straightforward. However, there are many situations in 
which the distribution of a variable Y is only expressed conditional on the value of another variable 
X. For these so-called hierarchical models, the Law of Total Expectation proves very useful. 


Example 5.29 A ferry goes from the left bank of a small river to the right bank once an hour. The 
ferry can accommodate at most two vehicles. The probability that no vehicles show up is .1, that 
exactly one shows up is .7, and that two or more show up is .2 (but only two can be transported). The 
fare paid for a vehicle depends upon its weight, and the average fare per vehicle is $25. What is the 
expected fare for a single trip made by this ferry? 

Let X represent the number of vehicles that show up, and let Y denote the total fare for a single 
trip. The conditional mean of Y, given X, is E(Y | X) = 25X. So, by the Law of Total Expectation, 


E(Y) = ElE(¥|X)] = E[25X] = 25x px(x 


= (0)(.1) + (25)(.7) + (50)(. 2)= $27.50 a 


Now consider computing the variance of Y by conditioning on the value of X. There are two 
contributions to V(Y). The first part is the variance of the random variable E(Y|X). The second part 
involves the random variable V(Y|X)—the variance of Y as a function of X—and in particular the 
expected value of this random variable. 
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LAW OF TOTAL VARIANCE For any two random variables X and Y, 


V(Y) = VIE(Y|X)] + E[V(Y|X)] 


Proving the Law of Total Variance requires some slightly clever algebra; see Exercise 90. 


Example 5.30 Let’s verify the Law of Total Variance for the apartment scenario of Example 5.28. 
The pmf of the rv E(Y|X) appears in that example, from which its variance is given by 


VIE(Y|X)) = @ = 1.385)°(.1) ++ 1.2 — 1.385)7(.3) + (1.5 — 1.385)"(.4) + (1.625 — 1.385)7(.2) 
= .0419 


Recall that 1.385 is the mean of the rv E(Y|X), which, by the Law of Total Expectation, is also E(Y). 
The second term in the Law of Total Variance involves the variable V(Y|X), which requires deter- 
mining the conditional variance of Y given X =x for x =0, 1, 2, 3. Using the four conditional 
distributions displayed in Example 5.28, these are 


V(Y|X =0)=0; V(¥|X = 1) =.0933;  V(¥|X = 2) =.1875; V(¥|X = 3) = .171875 


The rv V(¥|X) takes on these four values with probabilities .1, .3, .4, and .2, respectively (again, these 
are inherited from the distribution of X). Thus, 


E[V(¥|X)] = 0(.1) + .0933(.3) + .1875(.4) + .171875(.2) = .137375 


Combining, V[E(Y|X)] + E[V(Y | X)] = .0419 + .137375 = .179275. This is exactly V(Y) computed 
using the marginal pmf of Y in Example 5.28, and the Law of Total Variance is verified for this 
example. o 


The computation of V(Y) in Example 5.30 is clearly not efficient; it is much easier, given the joint 
pmf of X and Y, to determine the variance of Y from its marginal pmf. As with the Law of Total 
Expectation, the real worth of the Law of Total Variance comes from its application to hierarchical 
models, where the distribution of one variable (Y, say) is only known conditional on the distribution 
of another rv. 


Example 5.31 In the manufacture of ceramic tiles used for heat shielding, the proportion of tiles that 
meet the required thermal specifications varies from day to day. Let P denote the proportion of tiles 
meeting specifications on a randomly selected day, and suppose P can be modeled by the following 
pdf: 


f(p) =9p® O<p<l 


At the end of each day, a random sample of n = 20 tiles is selected and each tile is tested. Let Y denote 
the number of tiles among the 20 that meet specifications; conditional on P = p, Y ~ Bin(20, p). Find 
the expected number of tiles meeting thermal specifications in a daily sample of 20, and find the 
corresponding standard deviation. 

From the properties of the binomial distribution, we know that E(Y|P = p) = np = 20p, so 
E(Y|P) = 20P. Applying the Law of Total Expectation, 
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1 


E(Y) = E[E(¥|P)] = E[20P] = [2 -f(p)dp = / 180p°dp = 18 
0 0 


This is reasonable: since E(P) = .9 by integration, the expected proportion of good tiles is 90%, and 
thus the expected number of good tiles in a random sample of 20 tiles is 18. 

Determining the standard deviation of Y requires the two pieces of the Law of Total Variance. 
First, using the rescaling property of variance, 


V[E(Y|P)| = V(20P) = 20°V(P) = 400V(P) 


The variance of P can be determined directly from the pdf of P via integration. The result is 
V(P) = 9/1100, so V[E(Y|P)] = 400(9/1100) = 36/11. Second, the binomial variance formula 
np(. — p) implies that the conditional variance of Y given P is V(Y|P) = 20P(1 — P), so 


I 
1 
EIV(Y|P)] = E20(1 — P)] = f 20p(1 —p)- 9p%dp =F 
0 
Therefore, by the Law of Total Variance, 
360618) 54 
V(Y) = VIE(Y|P)|+E[V(Y|P)] = ll + 7 ek ie 4,909, 


and the standard deviation of Y is gy = 4.909 = 2.22. This “total” standard deviation accounts for 
two effects: day-to-day variation in quality as modeled by P (the first term in the variance expression), 
and random variation in the number of observed good tiles as modeled by the binomial distribution 
(the second term). ei} 


Here is an example where the Laws of Total Expectation and Variance are helpful in finding the 
mean and variance of a random variable that is neither discrete nor continuous. 


Example 5.32 The probability of a claim being filed on an insurance policy is .1, and only one claim 
can be filed. If a claim is filed, the claim amount is exponentially distributed with mean $1000. Recall 
from Section 3.4 that “ = o for an exponential rv, so the variance is the square of this value. We want 
to find the mean and variance of the amount paid. Let X be the number of claims (0 or 1) and let 
Y be the payment. We know that E(Y|X = 0) = 0 and E(Y| X = 1) = 1000. Also, V(Y|X = 0) = 0 and 
VX = l= 1000° = 1,000,000. Here is a table for the distribution of E(Y|X = x) and V(¥|X = x): 


x P(X = x) E(Y|X = x) V(Y|X = x) 
0 a) 0 0 
1 at 1000 1,000,000 


Therefore, 


E(Y) = E[E(Y|X)] = E(Y|X = 0) - P(X = 0) + E(¥|X = 1) - P(X = 1) = 0(.9) + 1000(.1) = 100 


The average claim amount across all customers is $100. Next, the variance of the conditional mean is 
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V[E(Y|X)] = (0 — 100)?(.9) + (1000 — 100)?(.1) = 90,000, 


and the expected value of the conditional variance is 


E[V(¥|X)] = 0(.9) + 1,000,000(.1) = 100,000 


Now apply the Law of Total Variance to get V(Y): 


V(Y) = V[E(Y|X)] + E[V(Y|X)] = 90,000 + 100,000 = 190,000 


Taking the square root gives the standard deviation, ay = $434.89. 

Suppose that we want to compute the mean and variance of Y directly. Notice that X is discrete, but 
the conditional distribution of Y given X = 1 is continuous. The random variable Y itself is neither 
discrete nor continuous, because it has probability .9 of being 0, but the other .1 of its probability is 
spread out from 0 to oo. Such “mixed” distributions may require a little extra effort to evaluate means 
and variances, although it is not especially hard in this case (because the discrete mass is at 0 and 


doesn’t contribute to expectations): 


1 


B(Y) = (9)(0) + (4) fro 
0 


e~¥/1000 dy — (.1)(1000) = 100 


va A 
E(¥?) = (.9)°(0) + (.1) / Yage = (.1)(2,000,000) = 200,000 
0 


V(Y) = E(¥?) — [E(Y)}* = 200,000 — 10,000 = 190,000 


These agree with what we found using the theorems. 


Exercises: Section 5.4 (68-90) 


68. According to the 2017 CIRP report The 
American Freshman, 36.2% of first-year 
college students indentify as liberals, 22.4% 
as conservatives, and 41.4% characterize 
themselves as middle-of-the-road. Choose 
two students at random, let X be the number 
of liberals among the two, and let Y be the 
number of conservatives among the two. 


a. Using the multinomial distribution from 
Section 5.1, give the joint probability 
mass function p(x, y) of X and Y and the 
corresponding joint probability table. 

b. Determine the marginal probability mass 
functions by summing p(x, y) numeri- 
cally. How could these be obtained 
directly? [Hint: What are the univariate 
distributions of X and Y?] 


. Determine the conditional probability 


mass function of Y given X = x for x = 0, 
1, 2. Compare this to the binomial distri- 
bution with n = 2 — x and p = .224/.638. 
Why should this work? 


d. Are X and Y independent? Explain. 
. Find E(¥|X = x) for x = 0, 1, 2. Do this 


numerically and then compare with the 
use of the formula for the binomial 
mean, using the binomial distribution 
given in part (c). 


. Determine V(Y|X = x) for x = 0, 1, 2. Do 


this numerically and then compare with 
the use of the formula for the binomial 
variance, using the binomial distribution 
given in part (c). 
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69. 


70. 


71. 


Teresa and Allison each have arrival times 
uniformly distributed between 12:00 and 
1:00. Their times do not influence each 
other. If Y is the first of the two times and 
X is the second, on a scale of 0-1, it can be 
shown that the joint pdf of X and Y is 
fix, y) =2 for0< y<x<l. 


a. Determine the marginal density of X. 

b. Determine the conditional density of 
Y given X = x. 

c. Determine the conditional probability 
that Y is between O and .3, given that 
X is .5. 

d. Are X and Y independent? Explain. 

e. Determine the conditional mean of 
Y given X = x. 

f. Determine the conditional variance of 
Y given X = x. 


Refer back to the previous exercise. 


a. Determine the marginal density of Y. 
b. Determine the conditional density of 


X given Y= y. 
c. Determine the conditional mean of 
X given Y=y. 


d. Determine the conditional variance of 
X given Y= y. 


A pizza place has two phones. On each 
phone the waiting time until the first call is 
exponentially distributed with mean one 
minute. Each phone is not influenced by the 
other. Let X be the shorter of the two 
waiting times and let Y be the longer. It can 
be shown that the joint pdf of X and Y is 
f(x,y) = 2e-# +) 0<x<y<oo. 


a. Determine the marginal density of X. 

b. Determine the conditional density of 
Y given X = x. 

c. Determine the probability that Y is 
greater than 2, given that X = 1. 

d. Are X and Y independent? Explain. 

e. Determine the conditional mean of 
Y given X = x. 

f. Determine the conditional variance of 
Y given X = x. 


72. 


73. 
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A class has 10 mathematics majors, 6 
computer science majors, and 4 statistics 
majors. A committee of two is selected at 
random to work on a problem. Let X be the 
number of mathematics majors, and let Y be 
the number of computer science majors 
chosen. 


a. Determine the joint probability mass 
function p(x, y). This generalizes the 
hypergeometric distribution studied in 
Section 3.6. Give the joint probability 
table showing all nine values, of which 
three should be 0. 

b. Determine the marginal probability mass 
functions by summing numerically. 
How could these be obtained directly? 
[Hint: What are the univariate distribu- 
tions of X and Y?] 

c. Determine the conditional probability 
mass function of Y given X =x for 
x = 0, 1, 2. Compare with the hyperge- 
ometric h(y; 2 — x, 6, 10) distribution. 
Intuitively, why should this work? 

d. Are X and Y independent? Explain. 

e. Determine E(¥|X = x), x = 0, 1, 2. Do 
this numerically and then compare with 
the use of the formula for the hyperge- 
ometric mean, using the hypergeometric 
distribution given in part (c). 

f. Determine V(¥|X = x), x =0, 1, 2. Do 
this numerically and then compare with 
the use of the formula for the hyperge- 
ometric variance, using the hypergeo- 
metric distribution given in part (c). 


A one-foot-long stick is broken at a point 
X (measured from the left end) chosen ran- 
domly uniformly along its length. Then the 
left part is broken at a point Y chosen ran- 
domly uniformly along its length. In other 
words, X is uniformly distributed between 
0 and 1 and, given X = x, Y is uniformly 
distributed between 0 and x. 


a. Determine E(Y|X = x) and then V(Y|X = x). 
b. Determine f(x,y) using fx(x) and fyx(y|*). 
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74. 


75. 


76. 
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c. Determine f(y). 

d. Use fy) from (c) to get E(Y) and V(Y). 

e. Use (a) and the Laws of Total Expecta- 
tion and Variance to get E(Y) and V(Y). 


A system consisting of two components 
will continue to operate only as long as 
both components function. Suppose the 
joint pdf of the lifetimes (months) of the 
two components in a system is given by 
f(x, y)=c[l0—(x+y)] for x > 0, 
y> 0, x+y < 10. 


a. If the first component functions for 
exactly 3 months, what is the probability 
that the second functions for more than 
2 months? 

b. Suppose the system will continue to 
work only as long as both components 
function. Among 20 of these systems 
that operate independently of each other, 
what is the probability that at least half 
work for more than 3 months? 


Refer back to Exercise | of this chapter. 


a. Given that X= 1, determine the 
conditional pmf of Y—that is, pyx(0|1), 
Pyx(1|1), and pyx(2|1). 

b. Given that two hoses are in use at the 
self-service island, what is the condi- 
tional pmf of the number of hoses in use 
on the full-service island? 

c. Use the result of part (b) to calculate the 
conditional probability P(Y < 1|X = 2). 

d. Given that two hoses are in use at the 
full-service island, what is the condi- 
tional pmf of the number in use at the 
self-service island? 


The joint pdf of pressures for right and left 
front tires is given in Exercise 11. 


a. Determine the conditional pdf of Y given 
that X =x and the conditional pdf of 
X given that Y= y. 


Tis 


78. 


79. 


80. 


b. If the pressure in the right tire is found to 
be 22 psi, what is the probability that the 
left tire has a pressure of at least 25 psi? 
Compare this to P(Y > 25). 

c. If the pressure in the right tire is found to 
be 22 psi, what is the expected pressure 
in the left tire, and what is the standard 
deviation of pressure in this tire? 


Suppose that X is uniformly distributed 
between 0 and 1. Given X =x, Y is uni- 
formly distributed between 0 and eo 


a. Determine E(Y|X = x) and then V(Y|X = x). 
b. Determine f(x,y) using fx(x) and 
Sf yix(¥ |x). 


c. Determine fy(y). 
Refer back to the previous exercise. 


a. Use fy(y) from the previous exercise to 
get E(Y) and V(Y). 

b. Use part (a) of the previous exercise and 
the Laws of Total Expectation and 
Variance to get E(Y) and V(Y). 


David and Peter independently choose at 
random a number from 1, 2, 3, with each 
possibility equally likely. Let X be the 
larger of the two numbers, and let Y be the 
smaller. 


a. Determine p(x, y). 
b. Determine py(x), x = 1, 2, 3. 

c. Determine pyjx(y|x). 

d. Determine E(Y|X = x) for x = 1, 2, 3. 
e. Determine V(Y|X = x) for x= 1, 2, 3. 


Refer back to the previous exercise. Find 


a. E(X). 

b. py). 

c. E(Y) using pyy). 

d. E(Y) using E(Y|X). 

e. E(X) + E(Y). Why does your answer 
make intuitive sense? 
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81. 


82. 


83. 


84. 


85. 


Refer back to the previous two exercises. 
Find 

a. pxiy(a|y). 

b. E(X|Y = y) for y = 1, 2, 3. 

c. V(X|Y = y) for y = 1, 2, 3. 


Consider three ping-pong balls numbered 
1, 2, and 3. Two balls are randomly selec- 
ted with replacement. If the sum of the two 
resulting numbers exceeds 4, two balls are 
again selected. This process continues until 
the sum is at most 4. Let X and Y denote 
the last two numbers selected. Possible 
(X, Y) pairs are {(1, 1), (1, 2), C1, 3), (2, D, 
(2, 2), 3, 1}. 


a. Determine py (x,y). 

b. Determine pyx(y|x). 

c. Determine E(Y|X = x). Is this a linear 
function of x? 

d. Determine E(X|Y = y). What special 
property of p(x, y) allows us to get this 
from (c)? 

e. Determine V(Y|X = x). 


Let X be a random digit (0, 1, 2, ..., 9 are 
equally likely), and let Y be a random digit 
not equal to X. That is, the nine digits other 
than X are equally likely for Y. 


a. Determine px(x), pyx(|x), and px x, y). 
b. Determine a formula for E(Y|X = x). 


Consider the situation in Example 5.29, and 
suppose further that the standard deviation 
for fares per car is $4. 


a. Find the variance of the rv E(Y|X). 

b. Using Expression (5.6) from the previ- 
ous section, the conditional variance of 
Y given X = x is 4*x = 16x. Determine 
the mean of the rv V(Y|X). 

c. Use the Law of Total Variance to find oy, 
the unconditional standard deviation of Y. 


This week the number X of claims coming 
into an insurance office is Poisson with 
mean 100. The probability that any 


86. 


87. 


88. 


329 


particular claim relates to automobile 
insurance is .6, independent of any other 
claim. If Y is the number of automobile 
claims, then Y is binomial with X trials, 
each with “success” probability .6. 


a. Determine E(Y|X =x) and V(Y|X = x). 
b. Use part (a) to find E(Y). 
c. Use part (a) to find V(Y). 


In the previous exercise, show that the 
distribution of Y is Poisson with mean 60. 
[You will need to recognize the Maclaurin 
series expansion for the exponential func- 
tion.] Use the knowledge that Y is Poisson 
with mean 60 to find E(Y) and V(Y). 


The heights of American men follow a 
normal distribution with mean 70 in. and 
standard deviation 3 in. Suppose that the 
weight distribution (lbs) for men that are 
x inches tall also has a normal distribution, 
but with mean 4x — 104 and _ standard 
deviation .3x — 17. Let Y denote the weight 
of a randomly selected American man. Find 
the (unconditional) mean and _ standard 
deviation of Y. 


A Statistician is waiting behind one person to 
check out at a store. The checkout time for 
the first person, X, can be modeled by an 
exponential distribution with some parame- 
ter 1 > 0. The statistician observes the first 
person’s checkout time, x; being a statisti- 
cian, she surmises that her checkout time 
Y will follow an exponential distribution 
with mean x. 


a. Determine E(Y|X = x) and V(Y|X = x). 

b. Use the Laws of Total Expectation and 
Variance to find E(Y) and V(Y). 

c. Write out the joint pdf of X and Y. [Hint: 
You have fx(x) and fyjx(y|x).] Then write 
an integral expression for the marginal 
pdf of Y (from which, at least in theory, 
one could determine the mean and vari- 
ance of Y). What happens? 
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89. In the game Plinko on the television game conditional mean and variance of Y, 


show The Price is Right, contestants have 
the opportunity to earn “chips” (flat, cir- 
cular disks) that can be dropped down a peg 
board into slots labeled with cash amounts. 
Every contestant is given one chip auto- 
matically and can earn up to four more 
chips by correctly guessing the prices of 
certain small items. If we let p denote the 
probability a contestant correctly guesses 
the price of a prize, then the number of 


given a player gets x chips, are wx and 
Caer respectively, where 4 and o° are the 
mean and variance for a single chip 
computed in (b). Find expressions for 
the (unconditional) mean and standard 
deviation of Y. [Note: Your answers will 
be functions of p.] 


d. Evaluate your answers to part (c) for 


p = 0, .5, and 1. Do these answers make 
sense? Explain. 


chips a contestant earns, X, can be modeled 

as X = 1 + N, where N ~ Bin(4, p). 

a. Determine E(X) and V(X). 

b. For each chip, the amount of money 
won on the Plinko board has the fol- 
lowing distribution: 


90. Let X and Y be any two random variables. 


a. Show that E[V(¥|X)] = El?) - E| 41x]. 
[Hint: Use the variance shortcut formula 


and apply the Law of Total Expectation 
to the first term.] 


b. Show that V(E[Y|X]) = E|x|- 


(E[Y])°. [Hint: Use the variance short- 
cut formula again; this time, apply the 
Law of Total Expectation to the second 
term. ] 

c. Combine the previous two results to 
establish the Law of Total Variance. 


Value $0 $100 $500 $1,000 $10,000 
Probability 39 03 ll .24 23 


Determine the mean and variance of the 
winnings from a single chip. 

c. Let Y denote the total winnings of a 
randomly selected contestant. Using 
results from the previous section, the 


5.5. The Bivariate Normal Distribution 


Perhaps the most useful joint distribution is the bivariate normal distribution. Although the formula 
may seem rather complicated, it is based on a simple quadratic expression in the standardized 
variables (subtract the mean and then divide by the standard deviation). The bivariate normal pdf is 


1 


1 a8), (—) (—) (C=) 
ex 2 + 
2no\02./ 1 — p? r( 2(1 — p*) ( om e | o2 02 


for —oo <x<0o, —co0 <y<oo. The notation used here for the five parameters reflects the roles they 
play. Some careful integration shows that ~, and o, are the mean and standard deviation, respectively, 
of X; 42 and o2 are the mean and standard deviation of Y; and p is the correlation coefficient between 
the two variables. The integration required to do bivariate normal probability calculations is quite 
difficult. Computer code is available for calculating P(X < x, Y < y) approximately using numerical 
integration, and some software packages (e.g., R, SAS, Stata) include this feature. 


f(x,y) = 
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The density surface in three dimensions looks like a mountain with elliptical cross sections, 
as shown in Figure 5.7a. The vertical cross sections are all proportional to normal densities. If we set 
(x, y) = c to investigate the contours (curves along which the density is constant), this amounts to 
equating the exponent of the joint pdf to a constant. The contours are then concentric ellipses centered 
at (x, y) = (44, Hy), as shown in Figure 5.7b. 


> XxX 


Figure 5.7 (a) A graph of the bivariate normal pdf; (b) contours of the bivariate normal pdf 


If p = 0, then the bivariate normal pdf simplifies to f(x,y) = fx) f(y), where X ~ N(,, 01) and 
Y~N(ty, 02). That is, X and Y have independent normal distributions. (In this case, the elliptical 
contours reduce to circles.) Recall that in Section 5.2 we emphasized that independence of X and 
Y implies p = 0 but, in general, p = 0 does not imply independence. However, we have just seen that 
when X and Y are bivariate normal p = 0 does imply independence. Therefore, in the bivariate normal 
case p = 0 if and only if the two rvs are independent. 

Regardless of whether or not p = 0, the marginal distribution f,(x) is just a normal pdf with mean 
Lt, and standard deviation 01: 


1 
o,V 20 


The integration to show this [integrating f(x, y) on y from —oo to oo] is rather messy. Likewise, the 
marginal distribution of Y is N(u, ¢2). These two marginal pdfs are, in fact, just special cases of a much 
stronger result (whose proof relies on some advanced matrix theory and will not be presented here). 


oom)? /(203) 


fx(x) = 


THEOREM Random variables X and Y have a bivariate normal distribution if and only if every 
linear combination of X and Y is normal; ie., the rv aX + bY + c has a normal 
distribution for any constants a, b, c (except the case a = b = 0). 


Example 5.33 Many students applying for college take the SAT, which consists of math and verbal 
components (the latter is currently called evidence-based reading and writing). Let X and Y denote the 
math and verbal scores, respectively, for a randomly selected student. According to the College 
Board, the population of students taking the exam in 2017 had the following results: 


My = 527, o, =107, bw =533, o2=100, p=.77 
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Suppose that X and Y have approximately (because both X and Y are discrete) a bivariate normal 
distribution. Let’s determine the probability that a student’s total score across these two components 
exceeds 1250, the minimum admission score for a particular university. 

Our goal is to calculate P(X + Y > 1250). Using the bivariate normal pdf, the desired probability 
is a daunting double integral: 


1 / i —{{(x-527) /107]? -2(.77) (x-527)(y—533) /(107)(100) + [(y—533) /100}" } /[21—.77°)| 
e Ue maka : sia VN dxd: 
2n(107)(100) V1 — 772 


—oo 1250-—y 


This is not a practical way to solve this problem! Instead, recognize X + Y as a linear combination of 
X and Y; by the preceding theorem, X + Y has a normal distribution. The mean and variance of 
X + Y are calculated using the formulas from Section 5.3: 


E(X+Y) = E(X) + E(Y) = pw, +, = 5274+533 = 1060 
)+ V(Y) + 2Cov(X, Y) 
a3 +2pa 02 = 107° + 100? + 2(.77)(107)(100) = 37,927 


Therefore, P(X + ¥ > 1250) = 1 — @( 259-1060) - 1 — @(,98) = .1635, 


Suppose instead we wish to determine P(X < Y), the probability a student scores lower on math 
than on reading. If we rewrite this probability as P(X — Y < 0), then we may apply the preceding 
theorem to the linear combination X — Y. With E(X — Y) = -6 and V(X — Y) = 4971, 


0 — (-6) 


P(X<Y) = P(X —Y<0) = o( asi 


) = @(.09) = 5359 | 


Independent Normal Random Variables 

As alluded to earlier in this section, if X and Y are independent normal rvs then the joint distribution 
of X and Y is trivially bivariate normal (specifically with p = 0). In Section 5.3, we proved that any 
linear combination of independent normal rvs is itself normally distributed, which comports with the 
earlier theorem in this section. In fact, we can generalize to the case of two linear combinations of 
independent normal rvs. 


PROPOSITION Let U and V be linear combinations of the independent normal rvs Xj, ..., Xn. 
Then the joint distribution of U and V is bivariate normal. The converse is also 
true: if U and V have a bivariate normal distribution, then they can be expressed 
as linear combinations of independent normal rvs. 


The proof uses the methods of the next section together with a little matrix theory. 


Example 5.34 How can we simulate bivariate normal rvs with a specified correlation p? Let Z; and 
Z> be independent standard normal rvs (which can be generated using software, or by applying the 
Box—Muller method described in Exercise 107), and define two new variables 


5.5 The Bivariate Normal Distribution 333 


U=2Z2, V=p-Z4+vV1-p?:Z 


Then U and V are linear combinations of independent normal rvs, so their joint distribution is 
bivariate normal by the preceding proposition. It can be shown (Exercise 129) that U and V each have 
mean 0 and standard deviation 1, and Corr(U, V) = p. 

Now suppose we wish to simulate from a bivariate normal distribution with an arbitrary set of 
parameters 21, 01, M2, 02, and p. Define X and Y by 


X=m4¢U=H404, YE wmt+oV=hH+0(pZ,4+V/1—-p*Z) (5.7) 


Since X and Y in Expression (5.7) are linear functions of U and V, it follows from Section 5.2 that 
Corr(X, Y) = Corr(U, V) = p. Moreover, since wy = Uy = 0 and oy = oy = 1, these linear trans- 
formations give X and Y the desired means and standard deviations. So, to simulate a bivariate normal 
distribution, create a pair of independent standard normal variates z, and z2, and then apply the 
formulas for X and Y in Expression (5.7). (Notice also that we’ve just proved the “converse” part of 
the foregoing proposition.) a 


Conditional Distributions of X and Y 
The conditional density of Y given X = x results from dividing the marginal density of X into f(x,y). 
The algebra is again tedious, but the result is fairly simple. 


PROPOSITION Let X and Y have a bivariate normal distribution. Then the conditional distribution 
of Y, given X = x, is normal with mean and variance 


x hy 
O1 


My|xax = E(Y|X = x) = by + por 


O yy» = V(Y|X = x) = 03(1 — p’) 


Notice that the conditional mean of Y is a linear function of x, and the conditional variance of 
Y doesn’t depend on x at all. When p = 0, the conditional mean is the mean of Y, m2, and the 
conditional variance is just the variance of Y, a. In other words, if p = 0, then the conditional 
distribution of Y is the same as the unconditional distribution of Y. When 9 is close to | or —1 the 
conditional variance will be much smaller than V(Y), which says that knowledge of X will be very 
helpful in predicting Y. If p is near 0 then X and Y are nearly independent and knowledge of X is not 
very useful in predicting Y. 


Example 5.35 Let X and Y be the heights of a randomly selected mother and her daughter, 
respectively. A similar situation was one of the first applications of the bivariate normal distribution, 
by Francis Galton in 1886, and the data was found to fit the distribution very well. Suppose a bivariate 
normal distribution with mean 4, = 64 in. and standard deviation o, = 3 in. for X and mean ply = 65 
in. and standard deviation o> = 3 in. for Y. Here pty > fy, which is in accord with the increase in 
height from one generation to the next. Assume p = .4. Then 


x — py x — 64 
= 65+ .4(3 
onl " (3) 3 


My|x=x = Ho + po2 = 65+ .4(x — 64) = 4x4 39.4 


334 5 Joint Probability Distributions and Their Applications 


Ox, = V(Y|X = x) = 05(1 — p?) = 9(1 — 47) = 7.56 and oyy_, = 2.75. 


Notice that the conditional variance is 16% less than the variance of Y. Squaring the correlation gives 
the percentage by which the conditional variance is reduced relative to the variance of Y. Bi 


Regression to the Mean 
The formula for the conditional mean can be re-expressed as 


Mylx=x — by = am Ly 
02 =e 01 


In words, when the formula is expressed in terms of standardized variables, the standardized con- 
ditional mean is just p times the standardized x. In particular, for the height scenario 


Hyixax 05 qk 
3 3 


If the mother is 5 in. above the mean of 64 in. for mothers, then the daughter’s conditional expected 
height is just 2 in. above the mean for daughters. In this example, with equal standard deviations for 
Y and X, the daughter’s conditional expected height is always closer to its mean than the mother’s 
height is to its mean. One can think of the conditional expectation as falling back toward the mean, 
and that is why Galton called this regression to the mean. 


Regression to the mean occurs in many contexts. For example, let X be a baseball player’s average 
for the first half of the season and let Y be the average for the second half. Most of the players with a 
high X (say, above .300) will not have such a high Y. The same kind of reasoning applies to the 
“sophomore jinx,” which says that if a player has a very good first season, then the player is unlikely 
to do as well in the second season. 


The Multivariate Normal Distribution 

The multivariate normal distribution extends the bivariate normal distribution to situations involving 
models for n random variables X,, Xz, ..., X,, with n > 2. The joint density function is quite com- 
plicated; the only way to express it compactly is to make use of matrix algebra notation, and 
probability calculations based on this distribution are extremely complex. Here are some of the most 
important properties of the distribution: 


The distribution of any linear combination of X,, Xo, ..., X, is normal. 

The marginal distribution of any X; is normal. 

The joint distribution of any pair X;, X; is bivariate normal. 

The conditional distribution of any X;, given values of the other n — 1 variables, is normal. 


Many procedures for the analysis of multivariate data (observations simultaneously on three or 
more variables) are based on assuming that the data was selected from a multivariate normal dis- 
tribution. The book by Rencher and Christensen (see the bibliography) provides more information on 
multivariate analysis and the multivariate normal distribution. 
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Exercises: Section 5.5 (91-100) 


91. 


92. 


93. 


For a few years, the SAT consisted of three 
components: writing, critical reading, and 
mathematics. Let W = SAT Writing score 
and X = SAT Critical Reading score for a 
randomly selected student. According to 
the College Board, in 2012 W had mean 
488 and standard deviation 114, while 
X had mean 496 and standard deviation 
114. Suppose X and W have a bivariate 
normal distribution with Corr(X, W) = .5. 


a. An English department plans to use 
X + W, a student’s total score on the 
nonmath sections of the SAT, to help 
determine admission. Determine the 
distribution of X + W. 

b. Calculate P(X + W > 1200). 

c. Suppose the English department wishes 
to admit only those students who score 
in the top 10% on this Critical Read- 
ing + Writing criterion. What combined 
score separates the top 10% of students 
from the rest? 


Refer to the previous exercise. Let Y= 
SAT Mathematics score, which had mean 
514 and standard deviation 117 in the year 
2012. Lett T=W+X+/Y, a student’s 
grand total score on the three components 
of the SAT. 


a. Find the expected value of T. 

b. Assume Corr(W, Y) = .2. and Corr(X, Y) 
= .25. Find the variance of T. [Hint: Use 
Expression (5.5) from Section 5.3.] 

c. Suppose W, X, Y have a multivariate 
normal distribution, in which case T is 
also normally distributed. Determine 
P(T > 2000). 

d. What is the 99th percentile of SAT 
grand total scores, according to this 
model? 


Let X = height (inches) and Y= weight 
(Ibs) for an American male. Suppose X and 
Y have a bivariate normal distribution, the 


94. 


95. 
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mean and sd of heights are 70 in and 3 in, 
the mean and sd of weights are 170 Ibs and 
20 Ibs, and p = .9. 


a. Determine the distribution of Y given 
X = 68, ie., the weight distribution for 
5'8” American males. 

b. Determine the distribution of Y given 
X = 70, i.e., the weight distribution for 
5’10” American males. In what ways is 
this distribution similar to that of part 
(a), and how are they different? 

c. Calculate P(Y < 180|X = 72), the prob- 
ability that a 6-foot-tall American male 
weighs less than 180 Ib. 


In electrical engineering, the unwanted 
“noise” in voltage or current signals is often 
modeled by a Gaussian (1.e., normal) dis- 
tribution. Suppose that the noise in a par- 
ticular voltage signal has a constant mean 
of 0.9 V, and that two noise instances 
sampled t seconds apart have a bivariate 
normal distribution with covariance equal 
to 0.04e1!"°, Let X and Y denote the noise 
at times 3 s and 8 s, respectively. 


a. Determine Cov(X, Y). 
b. Determine oy and oy. [Hint: V(X) = 
Cov(X, X).] 

. Determine Corr(X, Y). 

d. Find the probability we observe greater 
voltage noise at time 3 s than at time 
8s. 

e. Find the probability that the voltage 
noise at time 3 s is more than | V above 
the voltage noise at time 8 s. 


Q 


For a Calculus I class, the final exam score 
Y and the average X of the four earlier tests 
have a bivariate normal distribution with 
mean jl; = 73, standard deviation a, = 12, 
mean ply = 70, standard deviation a2 = 15. 
The correlation is p = .71. Determine 


a. My|x=x 
2 
b. Oy \x=x 


C. Oyy_y 
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96. 


97. 


98. 


d. P(Y > 90|X = 80), i., the probability 
that the final exam score exceeds 90 
given that the average of the four earlier 
tests is 80. 


Refer to the previous exercise. Suppose a 
student’s Calculus I grade is determined by 
4X + Y, the total score across five tests. 


a. Find the mean and standard deviation of 
4X + Y. 

b. Determine P(4X + Y < 320). 

c. Suppose the instructor sets the curve in 
such a way that the top 15% of students, 
based on total score across the five tests, 
will receive As. What point total is 
required to get an A in Calculus I? 


Let X and Y, reaction times (sec) to two 
different stimuli, have a bivariate normal 
distribution with mean jl, = 20 and stan- 
dard deviation o, =2 for X and mean 
Ha = 30 and standard deviation o2 = 5 for 
Y. Assume p = .8. Determine 


Refer to the previous exercise. 


a. One researcher is interested in X + Y, the 
total reaction time to the two stimuli. 
Determine the mean and standard devi- 
ation of X + Y. 

b. If X and Y were independent, what 
would be the standard deviation of 
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99. 


X + Y? Explain why it makes sense that 
the sd in part (a) is much larger than this. 

c. Another researcher is interested in Y — X, 
the difference in the reaction times to the 
two stimuli. Determine the mean and 
standard deviation of Y — X. 

d. If X and Y were independent, what 
would be the standard deviation of Y — 
X? Explain why it makes sense that the 
sd in part (c) is much smaller than this. 


Let X and Y be the times for a randomly 
selected individual to complete two differ- 
ent tasks, and assume that (X, Y) has a 
bivariate normal distribution with , = 100, 
0, =50, fo = 25, 62 =5, p =.4. From 
statistical software we obtain P(X < 100, 
Y < 25) = .3333, P(X < 50, Y < 20) 
= .0625, P(X < 50, Y < 25) = .1274, and 
P(X < 100, Y < 20) = .1274. 


a. Determine P(50 < X < 100, 20 < Y < 25). 
b. Leave the other parameters the same but 
change the correlation to p = 0 (inde- 


pale ft pendence). Now re-compute the proba- 
bo 7 X=x bility in part (a). Intuitively, why should 
C. Oyiy iy the original be larger? 

d. P(Y > 46 | X = 25) 100. One of the propositions of this section gives 


an expression for E(Y|X = x). 


a. By reversing the roles of X and Y give a 
similar formula for E(X|Y = y). 

b. Both E(Y|X =x) and E(X|Y=y) are 
linear functions. Show that the product 
of the two slopes is p’. 


5.6 Transformations of Multiple Random Variables 


In Chapter 4 we discussed the problem of starting with a single random variable X, forming some 
function of X, such as Y = X? or Y = e*, and investigating the distribution of this new random variable 
Y. We now generalize this scenario by starting with more than a single random variable. Consider as 
an example a system having a component that can be replaced just once before the system itself 
expires. Let X, denote the lifetime of the original component and X> the lifetime of the replacement 
component. Then any of the following functions of X, and Xz may be of interest to an investigator: 
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1. The total lifetime, X, + X>. 

2. The ratio of lifetimes X,/X> (for example, if the value of this ratio is 2, the original component 
lasted twice as long as its replacement). 

3. The ratio X\/(X, + Xz), which represents the proportion of system lifetime during which the 
original component operated. 


The Joint Distribution of Two New Random Variables 

Given two random variables X; and X>, consider forming two new random variables Y, = u,(X,, X2) 
and Y> = u(X, Xz). (Since most applications assume that the X;’s are continuous, we restrict our- 
selves to that case.) Our focus is on finding the joint distribution of these two new variables. The u,(-) 
and u>(-) functions express the new variables in terms of the original ones. The upcoming general 
result presumes that these functions can be inverted to solve for the original variables in terms of the 
new ones: 


X, =vi(¥1,¥2), X2 = v2(N1, Yo) 


For example, if 


1 


yp=xXyt+xX and y= 
Xj +X2 


then multiplying y. by y,; gives an expression for x,, and then we can substitute this into the 
expression for y, and solve for xp: 


X= V2 =Vi(M1,¥2) x2 =i (1 — yo) = vo(1, 2) 


Finally, let f(x;, x2) denote the joint pdf of the two original variables, let g(y;,y2) denote the joint pdf 
of the two new variables, and define two sets S and T by 


S = {(x1, x2) :f(%1,x2) > O} T= {(y1,y2) : g(y1,y2) > OF 


That is, S is the region of positive density for the original variables and T is the region of positive 
density for the new variables; T is the “image” of S under the transformation. 


TRANSFORMATION Suppose that the partial derivative of each v,(y;, y2) with respect to 
THEOREM (bivariate case) both y, and y» exists and is continuous for every (y;, y2) € T. Form 
the 2 x 2 matrix 


Ovi(yi.y2) Ovi (y1,y2) 
= Oy! Oy2 
M= | avcige)  ava(ige) 
Oy, Oy2 
The determinant of this matrix, called the Jacobian, is 


Ov, Ovy Ov, OAv2 


sO dy, Oy. Ayn Oy 
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The joint pdf for the new variables then results from taking the joint 
pdf f(x, x2) for the original variables, replacing x, and x2 by their 
expressions in terms of y; and yo, and finally multiplying this by the 
absolute value of the Jacobian: 


g(y1,¥2) =f (vi(y1, 2), 201, ¥2)) + |det(M)| (v1, y2) € T 


The theorem can be rewritten slightly by using the notation 


O(x1, 2) 


dey) = 


Then we have 
O(x1, x2) 
Oy ry y2) 


7 


e(ynv2) = Flea) | 


which is the natural extension of the univariate Transformation Theorem f(y) = fx(x) - |dx/dy| dis- 
cussed in Chapter 4. 


Example 5.36 Continuing with the component lifetime situation, suppose that X, and X> are inde- 
pendent, each having an exponential distribution with parameter 2. Let’s determine the joint pdf of 


Xx 


Y, = u1(X1,X2) = Xi +X. and Yo = wn(Xi%2) = yy 
1 


We have already inverted this transformation: 


x1 =vilyi,y2) =yiy2 %2 =V2(¥1,y2) =y1 (1 — yo) 


The image of the transformation, i.e., the set of (v1, y2) pairs with positive density, is y, > 0 and 
0 < yp < 1. The four relevant partial derivatives are 


Ov, _ Ov) _ Ov2 _ 


Ove 
52 5 SM B= ~ 
Oy dy,” Oy; 


1— yp a 


—Y1 


from which the Jacobian is det(M) = — yyy2 — yy. — yo) = -)1. 
Since the joint pdf of X; and X is 


ieee he le = ee ee 0, »>0 


we have, by the Transformation Theorem, 
yy | 2 —)y 
(1,92) =e ™ ey, = Nye ™ - 1 yp > 0, O<yr<1 
In the last step, we’ve factored the joint pdf into two parts: the first part is a gamma pdf with 


parameters « = 2 and f = 1//, and the second part is a uniform pdf on (0, 1). Since the pdf factors and 
the region of positive density is rectangular, we have discovered that 
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e The distribution of system lifetime X, + X2 is gamma (with « = 2, B = 1/2); 

e The distribution of the proportion of system lifetime during which the original component func- 
tions is uniform on (0, 1); and 

e Y, =X, + Xz and Y, = X, / (X; + X>) are independent of each other. a 


In the foregoing example, because the joint pdf factored into one pdf involving y, alone and 
another pdf involving yz alone, the individual (i.e., marginal) pdfs of the two new variables were 
obtained from the joint pdf without any further effort. Often this will not be the case—that is, Y, and 
Y>z will not be independent. Then to obtain the marginal pdf of Y,;, the joint pdf must be integrated 
over all values of the second variable. 

In fact, in many applications an investigator wishes to obtain the distribution of a single function 
Y, = u,(X1, X2) of the original variables. To accomplish this, a second function Y> = u2(X, X2) is 
created, the joint pdf is obtained, and then y, is integrated out. There are of course many ways to 
select the second function. The choice should be made so that the transformation can be easily 
inverted and the subsequent integration is straightforward. 


Example 5.37 Consider a rectangular coordinate system with a horizontal x,-axis and a vertical 
X-axis as shown in Figure 5.8a. 


a b 
x2 y2 
4 A 
1 1 
A possible 
rectangle 
OQ ey 0 y 
0 1 0 1 


Figure 5.8 Regions of positive density for Example 5.37 


First a point (X;, X2) is randomly selected, where the joint pdf of X,, X> is 
f(%1,%2) =x +x. O0<x1<1,0<m<1 


Then a rectangle with vertices (0, 0), (X;, 0), (0, X2), and (X,, X2) is formed as shown in Figure 5.8a. 
What is the distribution of X,X>, the area of this rectangle? Define 


Y; => uy (X, X2) => X {Xo and Y> => uz(X1, X2) =X 


The inverse of this transformation is easily obtained: 


y 
x1 =vi(y1,y2) =— and x2 = v2(y1, 2) = y2 
2 
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Notice that because x2 (=y2) is between 0 and 1 and y, is the product of the two x;’s, it must be the 
case that 0 < y; < yz. The region of positive density for the new variables is then T = {()1, yo): 
0 <y < yo, 0 < yo < 1}, the triangular region shown in Figure 5.8b. 


Since Ov /Oy, = 0, the product of the two off-diagonal elements in the matrix M will be 0, so only 
the two diagonal elements contribute to the Jacobian: 


— (ifr yi/y3 antes 
m= ( amar = det(M) =~ 


The joint pdf of the two new variables is now 
1 
(1,2) =s(%.») - |det(M)| = (2 +n) ve Dee ype 1 
y2 y2 y2 


Finally, to obtain the marginal pdf of Y, alone, we must now fix y, at some arbitrary value between 
0 and 1, and integrate out y2. Figure 5.8b shows that for any value of y,, the values of y2 range from 
y, to |: 


1 
airs) = | (2 +n) -—dy2 =2(1—-y1) 0<y <1 
y2 y2 


yi 


This marginal pdf can now be integrated to obtain any desired probability involving the area. For 
example, integrating from 0 to .5 gives P(Y, < .5) = .75. Hi 


The Joint Distribution of More Than Two New Variables 

Consider now starting with three random variables X;, X>, and X3, and forming three new variables 
Y,, Y2, and Y3. Suppose again that the transformation can be inverted to express the original variables 
in terms of the new ones: 


x, =Vv1(¥1, 92,93), X2 = vol(y1,¥2,¥3),  *3 = v3(V1, 2,93) 


Then the foregoing theorem can be extended to this new situation. The Jacobian matrix has dimension 
3 x 3, with the entry in the ith row and jth column being Ov; /Oy;. The joint pdf of the new variables 
results from replacing each x; in the original pdf f(-) by its expression in terms of the ys and 
multiplying by the absolute value of the Jacobian. 


Example 5.38 Consider n = 3 identical components with independent lifetimes X,, X2, X3, each 
having an exponential distribution with parameter /. If the first component is used until it fails, replaced 
by the second one which remains in service until it fails, and finally the third component is used until 
failure, then the total lifetime of these components is Y3 = X; + X2 + X3. (This design structure, where 
one component is replaced by the next in succession, is called a standby system.) To find the distribution 
of total lifetime, let’s first define two other new variables: Y; = X; and Y> = X, + X> (so that Y, < 
Y, < Y3). After finding the joint pdf of all three variables, we integrate out the first two variables to obtain 
the desired information. Solving for the old variables in terms of the new gives 


yy =Y1 X2=)2—)1 X3 = Y3— 2 
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It is obvious by inspection of these expressions that the three diagonal elements of the Jacobian 
matrix are all 1s and that the elements above the diagonal are all Os, so the determinant is 1, the 
product of the diagonal elements. Since 


Fn, %25%3) =e Mee 


by substitution, 


3 


8(1,92,¥3) = Are7 


x, > 0.x > 0,x3 > 0 


O0<y1 <y2<y3 


Integrating this joint pdf first with respect to y; between 0 and y, and then with respect to y. between 0 
and y3 (try it!) gives 


2B 


ga(ys) = Se 


2 


y3 >0 


which is the gamma pdf with « = 3 and f = 1//. This result is a special case of the last proposition 
from Section 5.3, stating that the sum of n iid exponential rvs has a gamma distribution with « = n. 


Exercises: Section 5.6 (101-108) 


101. 


102. 


103. 


Let X, and Xz be independent, standard 
normal rvs. 


a. Define Y; = X, + X> and Y, = X; — X>. 
Determine the joint pdf of Y; and Y>. 

b. Determine the marginal pdf of Y;. [Note: 
We know the sum of two independent 
normal rvs is normal, so you can check 
your answer against the appropriate 
normal pdf.] 

c. Are Y; and Y, independent? 


Consider two components whose lifetimes 
X, and Xz are independent and exponen- 
tially distributed with parameters 1, and 2», 
respectively. Obtain the joint pdf of total 
lifetime X, + X> and the proportion of total 
lifetime X,/(X, + X>) during which the first 
component operates. 


Let X, denote the time (hr) it takes to per- 
form a first task and X, denote the time it 
takes to perform a second one. The second 
task always takes at least as long to perform 
as the first task. The joint pdf of these 
variables is 


f(%1,%2) = 2(x1, +22) 0<x) <x <1 


a. Obtain the pdf of the total completion 
time for the two tasks. 

b. Obtain the pdf of the difference X2 — X, 
between the longer completion time and 
the shorter time. 


104. An exam consists of a problem section and 


a short-answer section. Let X, denote the 
amount of time (hr) that a student spends on 
the problem section and X> represent the 
amount of time the same student spends on 
the short-answer section. Suppose the joint 
pdf of these two times is 


f (x1, %2) = ex1x2 


x1 /3<x2<x1/2,0<x1<1 


a. What is the value of c? 

b. If the student spends exactly .25 h on the 
short-answer section, what is the probabil- 
ity that at most .60 h was spent on the 
problem section? [Hint: First obtain the 
relevant conditional distribution. ] 

c. What is the probability that the amount 
of time spent on the problem part of the 
exam exceeds the amount of time spent 
on the short-answer part by at least .5 h? 
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105. 


106. 


d. Obtain the joint distribution of Y; = X/X,, 
the ratio of the two times, and Y, = X>. 
Then obtain the marginal distribution of 
the ratio. 


Consider randomly selecting a point (Xj, 
X>, X3) in the unit cube according to the 
joint pdf 


0<x,<1, 
0<x3<1 


f (x1,%2,%3) = 8x1x2x3 
0<x <1, 


Then form a rectangular solid whose ver- 
tices are (0, 0, 0), (X1, 0, 0), (O, Xo, 0), 
(X1, X2, 0), (0, 0, X3), (Xi, 0, X3), (0, Xo, 
X3), and (X;, X>, X3). The volume of this 
solid is Y3 = X,X2X3. Obtain the pdf of Y3. 
[Hint: Let Y, = X, and Y, = X,X>.] 


Let X, and X, be independent, each having 
a standard normal distribution. The pair 
(X,, X2) corresponds to a point in a two- 
dimensional coordinate system. Consider 
now changing to polar coordinates via the 
transformation 


Y= X24 x2 
arctan (2) X, > 0,X2>0 
ye arctan (2) +2n X, >0,X.<0 
arctan (2) +n X,<0 
0 X,=0 
from which X, = \V/Y;cos(¥2), X2 = 


/Y; sin(Y2). Obtain the joint pdf of the 
new variables and then the marginal dis- 
tribution of each one. [Note: It would be 
preferable to let Y. = arctan(X>/X,), but in 
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107. 


108. 


order to insure invertibility of the arctan 
function, it is defined to take on values only 
between —1/2 and n/2. Our specification of 
Y> allows it to assume any value between 
0 and 21] 


The result of the previous exercise suggests 
how observed values of two independent 
standard normal variables can be generated 
by first generating their polar coordinates 
with an exponential rv with 7 = and an 
independent Unif(O, 27) rv: Let U, and U, 
be independent Unif(0, 1) rvs, and then let 


Y, = —21n(U;) Y> = 2nU, 
Z, = VY, cos(¥2) Z. = VY, sin(¥2) 


Show that the Z;’s are independent standard 
normal. [Note: This is called the Box— 
Muller transformation after the two indi- 
viduals who discovered it. Now that sta- 
tistical software packages will generate 
almost instantaneously observations from a 
normal distribution with any mean and 
variance, it is thankfully no longer neces- 
sary for people like you and us to carry out 
the transformations just described—let the 
software do it!] 


Let X, and Xz be independent random 
variables, each having a standard normal 
distribution. Show that the pdf of the ratio 
Y = X,/X> is given by f(y) = I/[n0. + yy] 
for —oo < y < ov. (This is called the stan- 
dard Cauchy distribution; its density curve 
is bell-shaped, but the tails are so heavy that 
Lt does not exist.) 


Many statistical procedures involve ordering the sample observations from smallest to largest and 
then manipulating these ordered values in various ways. For example, the sample median is either the 
middle value in the ordered list or the average of the two middle values depending on whether the 
sample size n is odd or even. The sample range is the difference between the largest and smallest 
values. And a trimmed mean results from deleting the same number of observations from each end of 
the ordered list and averaging the remaining values. 


Throughout this section, we assume that we have a collection of rvs X,, Xo, .. 


following properties: 


., X, With the 
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1. The X;’s are independent rvs. 

2. Every X; has the same probability distribution (e.g., they all follow an exponential distribution 
with the same parameter 4). 

3. The distribution shared by the X;’s is continuous, with cumulative distribution function F(x) and 
density function f(x). 


Assumptions | and 2 can be paraphrased by saying that the X;’s are a random sample from the 
specified distribution. The continuity assumption in 3 implies that P(X; = X;) = 0 fori 4 j; thus, with 
probability 1, the n sample observations will all be distinct (no ties). Of course, in practice all 
measuring instruments have accuracy limitations, so tied values may in fact result. 


DEFINITION The order statistics from a random sample are the random variables 
Y,, ... Y, given by 


Y, = the smallest among X,, Xo, ..., X, (i.e., the sample minimum) 
Y, = the second smallest among X,, Xo, ..., X, 
Y,, = the largest among X, Xo, ..., X,, (the sample maximum) 


Thus, with probability 1, Yj <Y2< +--+ <Yn_-1<Yjp. 


The sample median is then Y,,, , ))2 when n is odd, the sample range is Y,, — Y;, and for n = 10 the 


20% trimmed mean is pe Y;/6. The order statistics are defined as random variables (hence the use 
of uppercase letters); observed values are denoted by yy, ..., yp. 


The Distributions of Y, and Y, 

The key idea in obtaining the distribution of the sample maximum Y,, is the observation that Y,, is at 
most y if and only if every one of the X;’s is at most y. Similarly, the distribution of Y, is based on the 
fact that it will exceed y if and only if all X;’s exceed y. 


Example 5.39 Consider 5 identical components connected in parallel, as illustrated in Figure 5.9a. 


Figure 5.9 Systems of components for Example 5.39: (a) parallel connection; (b) series connection 
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Let X; denote the lifetime, in hours, of the ith component (i = 1, 2, 3, 4, 5). Suppose that the X;’s are 
independent and that each has an exponential distribution with 2 = .01, so the expected lifetime of 
any particular component is 1/A = 100 h. Because of the parallel configuration, the system will 
continue to function as long as at least one component is still working, and will fail as soon as the last 
component functioning ceases to do so. That is, the system lifetime is Ys, the largest order statistic in a 
sample of size 5 from the specified exponential distribution. Now Ys; will be at most y if and only if 
every one of the five X;’s is at most y. With Gs() denoting the cumulative distribution function of Ys, 


y) = P(X) <yN&<yN--+X5 <y) 
y) + P(X2 <y) -++++ P(X5 <y) 


= |F(y)P= [1 —e-)? 


The pdf of Y; can now be obtained by differentiating the cdf with respect to y. 

Suppose instead that the five components are connected in series rather than in parallel (Fig- 
ure 5.9b). In this case the system lifetime will be Yj, the smallest of the five order statistics, since the 
system will crash as soon as a single one of the individual components fails. Note that system lifetime 
will exceed y hours if and only if the lifetime of every component exceeds y hours. Thus 


Gi(y) = P(% <y) =1-P(Y, > y) 
=1-—P(X%, > yNX>yNn---NXs > y) 
=1—P(X4 > y)> Pika > y) ee P(Xs5 > y) 


a [eh] = 1 — e OY 


This is the form of an exponential cdf with parameter .05. More generally, if the n components in a 
series connection have lifetimes that are independent, each exponentially distributed with the same 
parameter /, then system lifetime will be exponentially distributed with parameter n/4. The expected 
system lifetime will then be 1/(n4), much smaller than the expected lifetime of an individual 
component. a 

An argument parallel to that of the previous example for a general sample size n and an arbitrary 
pdf f(x) gives the following general results. 


PROPOSITION Let Y, and Y,, denote the smallest and largest order statistics, respectively, based 
on a random sample from a continuous distribution with cdf F(x) and pdf ft). 
Then the cdf and pdf of Y,, are 


Gi(y) = [F(y)]" gn(y) =n[FO)I" f(y) 


The cdf and pdf of Y; are 


Gi(y) =1-[1- FO)" a1) =all - FO)!" F6) 


5.7 Order Statistics 345 
Example 5.40 Let X denote the contents of a one-gallon container, and suppose that its pdf is 
(x) = 2x for 0 < x < 1 (and 0 otherwise) with corresponding cdf F(x) = x on [0, 1]. Consider a 


random sample of four such containers. The order statistics Y,; and Y4 represent the contents of the 
least-filled container and the most-filled container, respectively. The pdfs of Y, and Y, are 


(1—y*)*-2y = 8y(1-y*)? O<y<I 


The corresponding density curves appear in Figure 5.10. 


? 84) 
4 


£10) 
A 8+ 
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Y4 
Figure 5.10 Density curves for the order statistics (a) Y; and (b) Y, in Example 5.40 
Let’s determine the expected value of Y, — Y,, the difference between the contents of the most- 


filled container and the least-filled container; Y4 — Y, is just the sample range. Apply linearity of 
expectation: 


8 384 


SS — 406 = 4 
9 945 889 06 83 


If random samples of four containers were repeatedly selected and the sample range of contents 
determined for each one, the long-run average value of the range would be .483 gallons. 7 


The Distribution of the ith Order Statistic 

We have already obtained the (marginal) distribution of the largest order statistic Y,, and also that of 
the smallest order statistic Y,. A generalization of the argument used previously results in the 
following proposition; the method of derivation is suggested in Exercise 114. 


PROPOSITION Suppose Xj, Xo, ..., X,, is arandom sample from a continuous distribution 
with cdf F(x) and pdf f(x). The pdf of the ith smallest order statistic Y; is 


n!\ i-l n-i 
si(y) = G- Dina PO)! [1 — FO)" FO) (5.8) 
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An intuitive justification for Expression (5.8) will be given shortly. Notice that it is consistent with the 
pdf expressions for g;(y) and g,(y) given previously; just substitute i = 1 and i = n, respectively. 


Example 5.41 Suppose that component lifetime is exponentially distributed with parameter 1. For a 
random sample of n = 5 components, the expected value of the sample median lifetime is 


r 5! 


B04) = f »-aodd= f ya Se Oey she May 
0 0 


Expanding out the integrand and integrating term by term, the expected value is .783/A. The median 
of the original exponential distribution is, from solving F (jt) = .5, # = —In(.5)/A = .693//. Thus if 
sample after sample of five components is selected, the long-run average value of the sample median 
Y3 will be somewhat larger than the median value of the individual lifetime distribution. This is 
because the exponential distribution has a positive skew. eT] 


There’s an intuitive “derivation” of Expression (5.8), the general order statistic pdf. Let A be a number 
quite close to 0, and consider the three intervals (—oo, y], (y, y + A], and (y + A, co). For a single X, the 
probabilities of these three intervals are p,; =P(X<y)=F(y), po =P(y<X<y+A)= 
LP" fade f(y) A, ps = P(X > yt A) =1-F(y+A). 

For a random sample of size n, it is very unlikely that two or more X’s will fall in the middle 
interval, since its width is only A. The probability that the ith order statistic falls in the middle interval 


is then approximately the probability that i —1 of the X’s are in the first interval, one is in the middle, 
and the remaining n — i are in the third. This is just a multinomial probability: 


P(y<¥;<y+A)*» : (FO)! £0) A [L- FO +A)" 


(Din —a! 


Dividing both sides by A and taking the limit as A — 0 gives exactly Expression (5.8). That is, we 
may interpret the pdf g,(y) as loosely specifying that i — 1 of the original observations are below y, one 
is “at” y, and the other n — i are above y. 


The Joint Distribution of All n Order Statistics 


We now develop the joint pdf of Y;, Y2, ..., Y,,, Consider first a random sample X,, X2, X3 of fuel 
efficiency measurements (mpg). The joint pdf of this random sample is 


Sf (x1, X2, x3) =f (x1) -f (x2) -f (x3) 


The joint pdf of Y;, Y2, Y3 will be positive only for values of y,, y2, y3 satisfying y; < yz < y3. What is 
this joint pdf at the values y, = 28.4, y. = 29.0, y3 = 30.5? There are six different ways to obtain 
these ordered values: 
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X,=28.4 X)=29.0 X3 = 30.5 X,=28.4 X=305 X3= 29.0 
X,=29.0 X>=28.4 X3 = 30.5 X,=29.0 X=30.5 X3= 284 
X, =30.5 X=28.4 X3= 29.0 X, = 30.5 X= 29.0 X3= 284 


These six possibilities come from the 3! ways to order the three numerical observations once their 
values are fixed. Thus 


g(28.4, 29.0, 30.5) = f (28.4) - f(29.0) - (30.5) + ++ 


+f (30.5) - f(29.0) - f(28.4) 
= 31f(28.4) - f(29.0) - f(30.5) 


KY ma 


Analogous reasoning with a sample of size n yields the following result: 


PROPOSITION Let g(1, yo, .-., Y,) denote the joint pdf of the order statistics Yj, Yo, ..., Y, 
resulting from a random sample of X;’s from a pdf f(x). Then 


8(V1,¥25 ++ In) = MF(y1) -f2) +++ fn) = i<ya< +++ <n 


For example, if we have a random sample of component lifetimes and the lifetime distribution is 
exponential with parameter /, then the joint pdf of the order statistics is 


B(V15-- Yn) = late Ot +9) Qcyy <yg< +++ <y_,<00 


Example 5.42 Suppose X,, X2, X3, and X4 are independent random variables, each uniformly 
distributed on the interval from 0 to 1. The joint pdf of the four corresponding order statistics Y,, Yo, 
Y3, and Y, is ftv, y2, 3, ya) = 4!-1 for 0 < yy < yo < y3 < y4 < 1. The probability that every pair of 
X;’s is separated by more than .2 is the same as the probability that Yy — Y,; > .2, Y3 — Y2 > .2, and 
Y, — Y3 > .2. This latter probability results from integrating the joint pdf of the Y;’s over the region 
6<yqa< 1, .4<y3<yq- .2,.2<y. <3 —- .2,0<y) <2 — 22: 


1 y4—.2 y3—.2 y2—.2 


P(¥,2-—Y, > .2,¥3-—Yo > .2,Ys-—Y3 > .2) = / ‘f / Aldydy2dy3dy4 
6 4 2: 0 


The inner integration gives 4!(y. —.2), and this must then be integrated between .2 and y3 — .2. 
Making the change of variable z2 = y2 — .2, the integration of z2 is from 0 to y3 — .4. The result of 
this integration is 4!-(y; — .4)?/2. Continuing with the 3rd and 4th integration, each time making an 
appropriate change of variable so that the lower limit of each integration becomes 0, the result is 


P(¥)—Y, > .2,¥3—Y2 > .2,¥,—Y3 > .2) =.4* = 0256 


A more general multiple integration argument for n independent uniform [0, B] rvs shows that the 
probability that all values are separated by at least d is 
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P(all values are separated by more than d) = { [1 — " 1)d/B) er aan 1) 

As an application, consider a year that has 365 days, and suppose that the birth time of someone born 
in that year is uniformly distributed throughout the 365-day period. Then in a group of 10 inde- 
pendently selected people born in that year, the probability that all of their birth times are separated by 
more than 24 h (d = 1 day) is (1 — 9/365)'° = .779. Thus the probability that at least two of the 10 
birth times are separated by at most 24 h is .221. As the group size n increases, it becomes more likely 
that at least two people have birth times that are within 24 h of each other (but not necessarily on the 
same day). For n = 16, this probability is .467, and for n = 17 it is .533. So with as few as 17 people 
in the group, it is more likely than not that at least two of the people were born within 24 h of each 
other. Coincidences such as this are not as surprising as one might think. The probability that at least 
two people are born on the same day (assuming equally likely birthdays) is much easier to calculate 
than what we have shown here; see The Birthday Problem in Example 2.22. a 


The Joint Distribution of Two Order Statistics 

Finally, we consider the joint distribution of two order statistics Y; and Y; with i < j. Consider first 
n = 6 and the two order statistics Y3; and Y;, We must then take the joint pdf of all six order statistics, 
hold y3 and ys fixed, and integrate out y,, y2, y4, and yg. That is, 


ys Y3 —CO Vl 


The result of this integration is 


6! 


a Fs) FO) — F(y3)]' + [1 — FQvs)]'FO3)f 0s) 


— 00 <y3<y5 <0O 


83,5(V3,¥5) = 


The foregoing derivation generalizes as follows. 


PROPOSITION Let gij(¥i, yj) denote the joint pdf of the order statistics Y; and Y;, i < j, resulting 
from a random sample of X;’s from a pdf f(x). Then 


n! 


Bi THY) =F DIG= i= Din Pro | 
[F(y;) — Fov) Ps [L- FO)" FOas07) 


for —0o <yj <yj <00 


This joint pdf can be “derived” intuitively by considering a multinomial probability similar to the 
argument presented for the marginal pdf of Y;. In this case, there are five relevant intervals: (—oo, y;], 


(vis ¥i + Ai], OF + Ai, yl, Oj, yj + Ao], and (y; + Ao, 00). 
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Exercises: Section 5.7 (109-121) 


109. 


110. 


111. 


112. 


A friend of ours takes the bus five days per 
week to her job. The five waiting times 
until she can board the bus are a random 
sample from a uniform distribution on the 
interval from 0 to 10 min. 


a. Determine the pdf and then the expected 
value of the largest of the five waiting 
times. 

b. Determine the expected value of the 
difference between the largest and 
smallest times. 

c. What is the expected value of the sample 
median waiting time? 

d. What is the standard deviation of the 
largest time? 


Refer back to Example 5.40. Because 
n=4, the sample median is (Y2 + Y3)/2. 
What is the expected value of the sample 
median, and how does it compare to the 
median of the population distribution? 


Referring back to Exercise 109, suppose 
you learn that the smallest of the five 
waiting times is 4 min. What is the condi- 
tional density function of the largest wait- 
ing time, and what is the expected value of 
the largest waiting time in light of this 
information? 


Let X represent a measurement error. It is 
natural to assume that the pdf f(x) is sym- 
metric about 0, so that the density at a value 
—c is the same as the density at c (an error of a 
given magnitude is equally likely to be pos- 
itive or negative). Consider a random sample 
of n measurements, where n = 2 k + 1, so 
that Y,,; is the sample median. What can be 
said about E(Y;.,)? If the X distribution is 
symmetric about some other value, so that 
value is the median of the distribution, what 
does this imply about E(Y,.,)? [Hints: 
For the first question, symmetry implies 
that 1 — F(x) = P(X > x) = P(X < —x) 
= F(—x). For the second question, consider 
W = X — pt; what is the median of the dis- 
tribution of W?] 


113. 


114. 


115. 


116. 
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A store is expecting n deliveries between 
the hours of noon and | p.m. Suppose the 
arrival time of each delivery truck is uni- 
formly distributed on this one-hour interval 
and that the times are independent of each 
other. What are the expected values of the 
ordered arrival times? 


The pdf of the second-largest order statistic, 
Y,-1, can be obtained using reasoning 
analogous to how the pdf of Y,, was first 
obtained. 


a. Forany number y, Y,_; < yifand only if 
at least n — 1 of the original X’s 
are < y.(Do you see why?) Use this fact 
to derive a formula for the cdf of Y,,_; in 
terms of F, the cdf of the X’s. [Hint: 
Separate “at least n — 1” into two cases and 
apply the binomial formula. ] 

b. Differentiate part (a) to obtain the pdf of 
Y,,-1. Simplify and verify it matches the 
formula for g,1(y) provided in this 
section. 


Let X be the amount of time an ATM is in 
use during a particular one-hour period, and 
suppose that X has the cdf F(x) = x” for 
0 <x <1 (where @ > 1). Give expressions 
involving the gamma function for both the 
mean and variance of the ith smallest 
amount of time Y; from a random sample of 
n such time periods. 


The logistic pdf f(x) = e*/(1+e~)* for 
—oo <x <0 is sometimes used to describe 
the distribution of measurement errors. 


a. Graph the pdf. Does the appearance of 
the graph surprise you? 

b. For a random sample of size n, obtain an 
expression involving the gamma func- 
tion for the moment generating function 
of the ith smallest order statistic Y; This 
expression can then be differentiated to 
obtain moments of the order statistics. 
[Hint: Set up the appropriate integral, 
and then let uw = 1/(1+e7*).] 
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117. 


118. 


119. 


120. 


121. 


An insurance policy issued to a boat owner 
has a deductible amount of $1000, so the 
amount of damage claimed must exceed 
this deductible before there will be a pay- 
out. Suppose the amount (1000s of dollars) 
of a randomly selected claim is a continu- 
ous rv with pdf f(x) = 3/x* for x > 1. Con- 
sider a random sample of three claims. 


a. What is the probability that at least one 
of the claim amounts exceeds $5000? 

b. What is the expected value of the largest 
amount claimed? 


Conjecture the form of the joint pdf of three 
order statistics Y;, ¥;, Y, (i<j < k) ina 
random sample of size n. 

Use the intuitive argument sketched in this 
section to obtain the general formula for the 
joint pdf of two order statistics given in the 
last proposition. 


Consider a sample of size n = 3 from the 
standard normal distribution, and obtain the 
expected value of the largest order statistic. 
What does this say about the expected value 
of the largest order statistic in a sample of 
this size from any normal distribution? 
[Hint: With (x) denoting the standard 
normal pdf, use the fact that (d/dx)¢(x) = 
—x(x) along with integration by parts.] 


Let Y, and Y,, be the smallest and largest 
order statistics, respectively, from a random 
sample of size n. 


a. Use the last proposition in this section to 
determine the joint pdf of Y, and Y,,. 
(Your answer will include the pdf f and 
cdf F of the original random sample.) 

b. Let W, = Y, and W2 = Y,, — Y; (the latter 
is the sample range). Use the method of 
Section 5.6 to obtain the joint pdf of W, 
and W,, and then derive an expression 
involving an integral for the pdf of the 
sample range. 

c. For the case in which the random sample 
is from a uniform distribution on [0, 1], 
carry out the integration of (b) to obtain an 
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explicit formula for the pdf of the sample 
range. [Hint: For the Uniform[0, 1] dis- 
tribution, what are f and F?] 


Supplementary Exercises: (122-150) 


122. 


123. 


124. 


Suppose the amount of rainfall in one 
region during a particular month has an 
exponential distribution with mean value 
3 in., the amount of rainfall in a second 
region during that same month has an 
exponential distribution with mean value 
2 in., and the two amounts are independent 
of each other. What is the probability that 
the second region gets more rainfall during 
this month than does the first region? 


Two messages are to be sent. The time 
(min) necessary to send each message has 
an exponential distribution with parameter 
A = 1, and the two times are independent of 
each other. It costs $2 per minute to send 
the first message and $1 per minute to send 
the second. Obtain the density function of 
the total cost of sending the two messages. 
[Hint: First obtain the cumulative distribu- 
tion function of the total cost, which 
involves integrating the joint pdf.] 


A restaurant serves three fixed-price din- 
ners costing $25, $35, and $50. For a ran- 
domly selected couple dining at this 
restaurant, let X = the cost of the man’s 
dinner and Y= the cost of the woman’s 
dinner. The joint pmf of X and Y is given in 
the following table: 


y 
p(x, y) 25 35 50 
25 .05 .0S .10 
x 35 .05 .10 35 
50 0 .20 .10 


a. Compute the marginal pmfs of X and Y. 

b. What is the probability that the man’s 
and the woman’s dinner cost at most 
$35 each? 
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125. 


f(x,y) = kxy 


126. 


127. 


c. Are X and Y independent? Justify your 
answer. 

d. What is the expected total cost of the 
dinner for the two people? 

e. Suppose that when a couple opens for- 
tune cookies at the conclusion of the 
meal, they find the message “You will 
receive as a refund the difference 
between the cost of the more expensive 
and the less expensive meal that you 
have chosen.” How much does the 
restaurant expect to refund? 

A health-food store stocks two different 

brands of a type of grain. Let X = the 

amount (lb) of brand A on hand and Y= 
the amount of brand B on hand. Suppose 
the joint pdf of X and Y is 


x>0, y>0, 20<x+y<30 


a. Draw the region of positive density and 
determine the value of k. 

b. Are X and Y independent? Answer by 
first deriving the marginal pdf of each 
variable. 

c. Compute P(X + Y < 25). 

d. What is the expected total amount of 
this grain on hand? 

e. Compute Cov(X, Y) and Corr(X, Y). 

f. What is the variance of the total amount 
of grain on hand? 


Let X,, X2, ..., X, be random variables 
denoting n independent bids for an item that 
is for sale. Suppose each X ; is uniformly 
distributed on the interval [100, 200]. If the 
seller sells to the highest bidder, how much 
can he expect to earn on the sale? [Hint: Let 
Y = max(Xj,X,...,X,). Find Fyy) by 
using the results of Section 5.7 or else by 
noting that Y < yiffeach X;is < y. Then 
obtain the pdf and E(Y).] 

Suppose a randomly chosen individual’s 
verbal score X and quantitative score Y on a 
nationally administered aptitude examina- 
tion have joint pdf 


128. 


129. 


130. 
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fy) = ~(2x+3y) O<x<1,0<y<1 


2 
5 
You are asked to provide a prediction ¢t of 
the individual’s total score X + Y. The error 
of prediction is the mean squared error 


EU(X + Y- t)’]. What value of t minimizes 
the error of prediction? 


Let X, and X, be quantitative and verbal 
scores on one aptitude exam, and let Y, and 
Y, be corresponding scores on another 
exam. If Cov(X,, Y;) = 5, Cov(X,, Y2) = 1, 
Cov(X>, Y;) =2, and Cov(X5, Y) = 8, 
what is the covariance between the two 
total scores X; + X> and Y; + Y>? 


Let Z, and Z, be independent standard 
normal rvs and let 


U=Z, V=p-Z+V1-p?:Z 


a. By definition, U has mean 0 and stan- 
dard deviation 1. Show that the same is 
true for V. 

b. Use the properties of covariance to show 
that Cov(U, V) = p. 

c. Show that Corr(U, V) = p. 


You are driving on a highway at speed X. 
Cars entering this highway after you travel 
at speeds X>, X3, .... Suppose these X;’s are 
independent and identically distributed with 
pdf fix) and cdf F(x). Unfortunately there is 
no way for a faster car to pass a slower one 
—it will catch up to the slower one and 
then travel at the same speed. For example, 
if X, = 52.3, X, = 37.5, and X3 = 42.8, 
then no car will catch up to yours, but the 
third car will catch up to the second. Let 
N = the number of cars that ultimately tra- 
vel at your speed (in your “cohort’”), 
including your own car. Possible values of 
N are 1, 2, 3, .... Show that the pmf of N is 
p(n) = 1/[n(v + 1)], and then determine the 
expected number of cars in your cohort. 
[Hint! N=3 requires that X, < Xo, 
X1 < X3, X4 < X,.] 


131. Suppose the number of children born to an 


individual has pmf p(x). A Galton—Watson 
branching process unfolds as follows: At 
time t=0, the population consists of a 
single individual. Just prior to time ¢ = 1, 
this individual gives birth to X; individuals 
according to the pmf p(x), so there are X, 
individuals in the first generation. Just prior 
to time t = 2, each of these X, individuals 
gives birth independently of the others 
according to the pmf p(x), resulting in X> 
individuals in the second generation (e.g., if 
X, = 3, then X> = Y; + Yo + Y3, where Y; is 
the number of progeny of the ith individual 
in the first generation). This process then 
continues to yield a third generation of size 
X3, and so on. 


a. IfX, = 3, Y, = 4, Yo = 0, Y; = 1,drawa 
tree diagram with two generations of 
branches to represent this situation. 

b. Let A be the event that the process ulti- 
mately becomes extinct (one way for A to 
occur would be to have X; = 3 with none 
of these three second-generation individ- 
uals having any progeny) and let p* = 
P(A). Argue that p* satisfies the equation 


pe = D7 (p*)"- p@) 


That is, p* = W(p*) where w(s) is the 
probability generating function intro- 
duced in Exercise 166 from Chapter 3. 
[Hint: A=), (AN{X =x}), so the 
Law of Total Probability can be applied. 
Now given that X,; = 3, A will occur if 
and only if each of the three separate 
branching processes starting from the 
first generation ultimately becomes 
extinct; what is the probability of this 
happening? 

c. Verify that one solution to the equation 
in (b) is p* = 1. It can be shown that this 
equation has just one other solution, and 
that the probability of ultimate extinc- 
tion is in fact the smaller of the two 
roots. If p(O) = .3, pC) = .5, and p(2) = 
.2, what is p*? Is this consistent with the 
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value of p, the expected number of 
progeny from a single individual? What 
happens if p(O)=.2, p(1)=.5, and 
p(2) = .3? 


132. Let f(x) and g(y) be pdfs with corresponding 


cdfs F(x) and G(y), respectively. With c 
denoting a numerical constant satisfying 
lc| < 1, consider 


fy) =f)sO){1 + cl2F(a) — YI2G(y) — U} 


a. Show that f(x, y) satisfies the conditions 
necessary to specify a joint pdf for two 
continuous rvs. 

b. What is the marginal pdf of the first 
variable X? Of the second variable Y? 

c. For what values of c are X and Y 
independent? 

d. If f(x) and g(y) are normal pdfs, is the 
joint distribution of X and Y bivariate 
normal? 


133. The joint cumulative distribution func- 


tion of two random variables X and Y, 
denoted by F(x, y), is defined by 


F(x,y) = P(X <x)N(¥ <y)| 
—o<x<o, -o<y<oo 


a. Suppose that X and Y are both continuous 
variables. Once the joint cdf is available, 
explain how it can be used to determine 
the probability P[(X, Y) € A], where 
A is the rectangular region {(x, y): 
a<x<bc<y< dh. 

b. Suppose the only possible values of 
X and Y are 0, 1, 2, ... and consider the 
values a=5, b= 10, c= 2, and d=6 
for the rectangle specified in (a). 
Describe how you would use the joint 
cdf to calculate the probability that the 
pair (X, Y) falls in the rectangle. More 
generally, how can the rectangular 
probability be calculated from the joint 
cdf if a, b, c, and d are all integers? 

c. Determine the joint cdf for the scenario 
of Example 5.1. [Hint: First determine 
F(x, y) for x = 100, 250 and y = 0, 100, 
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and 200. Then describe the joint cdf for 
various other (x, y) pairs.] 

d. Determine the joint cdf for the scenario 
of Example 5.3 and use it to calculate 
the probability that X and Y are both 
between .25 and .75. [Hint: For 0 < 
x<1 and O<y<1, F(x,y) 
So So. fu, v)dvdu.] 

e. Determine the joint cdffor the scenario of 
Example 5.5. [Hint: Proceed as in (d), but 
be careful about the order of integration 
and consider separately (x, y) points that 
lie inside the triangular region of positive 
density and then points that lie outside 
this region.] 


A circular sampling region with radius X is 
chosen by a biologist, where X has an 
exponential distribution with mean value 
10 ft. Plants of a certain type occur in this 
region according to a (spatial) Poisson 
process with “rate” .5 plant per square foot. 
Let Y denote the number of plants in the 
region. 


a. Find E(Y|X = x) and V(Y|X = x) 
b. Use part (a) to find E(Y). 
c. Use part (a) to find V(Y). 


The number of individuals arriving at a post 
office to mail packages during a certain 
period is a Poisson random variable X with 
mean value 20. Independently of the others, 
any particular customer will mail either 1, 
2, 3, or 4 packages with probabilities .4, .3, 
.2, and .1, respectively. Let Y denote the 
total number of packages mailed during this 
time period. 


a. Find E(Y|X = x) and V(Y|X = x). 
b. Use part (a) to find E(Y). 
c. Use part (a) to find V(Y). 


Consider a sealed-bid auction in which 
each of the n bidders has his/her valuation 
(assessment of inherent worth) of the item 
being auctioned. The valuation of any par- 
ticular bidder is not known to the other 
bidders. Suppose these valuations consti- 
tute a random sample X),...,X, from a 
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distribution with cdf F(x), with corre- 
sponding order statistics Y)<Y¥o<--- 
<Y,,. The rent of the winning bidder is the 
difference between the winner’s valuation 
and the price. The article “Mean Sample 
Spacings, Sample Size and Variability in an 
Auction-Theoretic Framework” (Oper. Res. 
Lett. 2004: 103-108) argues that the rent is 
just Y, — Y,-1 (do you see why’?). 


a. Suppose that the valuation distribution 
is uniform on [0, 100]. What is the 
expected rent when there are n = 10 
bidders? 

b. Referring back to (a), what happens 
when there are 11 bidders? More gen- 
erally, what is the relationship between 
the expected rent for n bidders and for 
n + 1 bidders? Is this intuitive? [Note: 
The cited article presents a counterex- 
ample.] 


Suppose two identical components are 
connected in parallel, so the system con- 
tinues to function as long as at least one of 
the components does so. The two lifetimes 
are independent of each other, each having 
an exponential distribution with mean 
1000 h. Let W denote system lifetime. 
Obtain the moment generating function of 
W, and use it to calculate the expected 
lifetime. 


Sandstone is mined from two different 
quarries. Let X = the amount mined (in 
tons) from the first quarry each day and 
Y = the amount mined (in tons) from the 
second quarry each day. The variables 
X and Y are independent, with yy = 12, 
ox = 4, wy = 10, oy = 3. 


a. Find the mean and standard deviation of 
the variable X + Y, the total amount of 
sandstone mined in a day. 

b. Find the mean and standard deviation of 
the variable X — Y, the difference in the 
mines’ performances in a day. 

c. The manager of the first quarry sells 
sandstone at $25/ton, while the manager 
of the second quarry sells sandstone at 
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$28/ton. Find the mean and standard 
deviation for the combined amount of 
money the quarries generate in a day. 

d. Assuming X and Y are both normally 
distributed, find the probability that the 
quarries generate more than $750 rev- 
enue in a day. 


In cost estimation, the total cost of a pro- 
ject is the sum of component task costs. 
Each of these costs is a random variable 
with a probability distribution. It is cus- 
tomary to obtain information about the 
total cost distribution by adding together 
characteristics of the individual component 
cost distributions—this is called the “roll- 
up” procedure. Since E(X, +--+ +X,) = 
E(X,) + + + E(X,), the roll-up procedure 
is valid for mean cost. Suppose that there 
are two component tasks and that X, and 
X are independent, normally distributed 
random variables. Is the roll-up procedure 
valid for the 75th percentile? That is, is the 
75th percentile of the distribution of 
X, + X> the same as the sum of the 75th 
percentiles of the two individual distribu- 
tions? If not, what is the relationship 
between the percentile of the sum and 
the sum of percentiles? For what per- 
centiles is the roll-up procedure valid in 
this case? 


Random sums. If X,, X2, ..., Xn are inde- 
pendent rvs, each with the same mean value 
wand variance o*, then the methods of 
Section 5.3 show that E(X,; ++» +X,) 
=np and V(X, +X. ++ +X,) = no. In 
some applications, the number of X;’s under 
consideration is not a fixed number n but 
instead a rv N. For example, let N be the 
number of components of a certain type 
brought into a repair shop on a particular 
day and let X; represent the repair time for 
the ith component. Then the total repair 
time is Ty =X, + X, ++ + Xy, the sum 
of a random number of rvs. 


a. Suppose that N is independent of the 
X;’s. Use the Law of Total Expectation 
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to obtain an expression for E(Ty) in 
terms of yu and E(N). 

b. Use the Law of Total Variance to obtain 
an expression for V(Ty) in terms of 1, 
a”, E(N), and V(N). 

c. Customers submit orders for stock pur- 
chases at a certain online site according to 
a Poisson process with a rate of 3 per hour. 
The amount purchased by any particular 
customer (in thousands of dollars) has an 
exponential distribution with mean 30, 
and purchase amounts are independent of 
the number of customers. What is the 
expected total amount ($) purchased dur- 
ing a particular 4-h period, and what is the 
standard deviation of this total amount? 


The mean weight of luggage checked by a 
randomly selected tourist-class passenger 
flying between two cities on a certain air- 
line is 40 lb, and the standard deviation is 
10 lb. The mean and standard deviation for 
a business-class passenger are 30 Ib and 
6 lb, respectively. 


a. If there are 12 business-class passengers 
and 50 tourist-class passengers on a par- 
ticular flight, what are the expected value 
of total luggage weight and the standard 
deviation of total luggage weight? 

b. If individual luggage weights are inde- 
pendent, normally distributed rvs, what 
is the probability that total luggage 
weight is at most 2500 Ib? 


The amount of soft drink that Ann con- 
sumes on any given day is independent of 
consumption on any other day and is nor- 
mally distributed with p= 130z and 
o = 2. If she currently has two six-packs of 
16-0z bottles, what is the probability that 
she still has some soft drink left at the end 
of 2 weeks (14 days)? Why should we 
worry about the validity of the indepen- 
dence assumption here? 


A student has a class that is supposed to 
end at 9:00 a.m. and another that is sup- 
posed to begin at 9:10 a.m. Suppose the 
actual ending time of the 9 a.m. class is a 
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normally distributed rv X; with mean 9:02 
and standard deviation 1.5 min and that the 
starting time of the next class is also a 
normally distributed rv X2 with mean 9:10 
and standard deviation | min. Suppose also 
that the time necessary to get from one 
classroom to the other is a normally 
distributed rv X3 with mean 6 min and 
standard deviation 1 min. Assuming inde- 
pendence of X,, X2, and X3, what is the 
probability that the student makes it to the 
second class before the lecture starts? Why 
should we worry about the reasonableness 
of the independence assumption here? 


This exercise provides an _ alternative 
approach to establishing the properties of 
correlation. 


a. Use the general formula for the variance 
of a linear combination to write an 
expression for V(aX + Y). Then let 
a=oy/lox, and show that p > -l. 
[Hint: Variance is always > 0, and 
Cov(X, Y) = ox: oy: p.] 

b. By considering V(aX — Y), conclude that 
p<. 

c. Use the fact that V(W) = 0 only if Wis a 
constant to show that p =1 only if 
Y=aX +b. 


A rock specimen from a particular area is 
randomly selected and weighed two differ- 
ent times. Let W denote the actual weight 
and X, and X, the two measured weights. 
Then X,;=W+E, and X,=W+ Ey, 
where E, and E>, are the two measurement 
errors. Suppose that the £;’s are indepen- 
dent of each other and of W and _ that 
V(E\) = V(E2) = oF. 


a. Express p, the correlation coefficient 
between the two measured weights X, 
and X>, in terms of Ga. the variance of 
actual weight, and Ox the variance of 
measured weight. 

b. Compute p when ow = 1 kg and og = 
.O1 kg. 
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Let A denote the percentage of one con- 
stituent in a randomly selected rock speci- 
men, and let B denote the percentage of a 
second constituent in that same specimen. 
Suppose D and F are measurement errors in 
determining the values of A and B so that 
measured values are X =A+D and Y= 
B +E, respectively. Assume that measure- 
ment errors are independent of each other 
and of actual values. 


a. Show that 


Corr(X, Y) = Corr(A, B) 
/ Corr(X,, X2) 
- s/Corr(Y1, Y2) 


where X; and X> are replicate measure- 
ments on the value of A, and Y, and Y> are 
defined analogously with respect to 
B. What effect does the presence of 


measurement error have on_ the 
correlation? 
b. What is the maximum value of 


Corr(X, Y) when Corr(X, X2) = .8100, 
Corr(Y1, Y2) = .9025? Is this disturbing? 


Let Xi, 
mean values [;, 


.... X, be independent rvs with 
...5 My and variances oi, 
3 O°. Consider a function h(x), ..., X,), 
and use it to define a new random variable 
Y=h(X, ..., X,). Under rather general 
conditions on the A function, if the os are 
all small relative to the corresponding ;s, it 
can be shown that E(Y) & h(t, ..., ln) and 


ah\? ah\? 
V(Y) & (x) ait + (Se) o 


where each partial derivative is evaluated at 
(X1, ---5 Xp) = (ly, ---» Un)» Suppose three 
resistors with resistances X,, X>, X3 are 
connected in parallel across a battery with 
voltage X4. Then by Ohm’s law, the current is 
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Let nw, = 109, 6, = 109, wm = 15 Q, 
02 = 10Q, ba = 20 Q, 03 = 1.5 Q, M4 = 
120 V, o4 = 4.0 V. Calculate the approxi- 
mate expected value and standard deviation 
of the current (suggested by “Random 
Samplings,” CHEMTECH 1984: 696-697). 


148. A more accurate approximation to 
E{h(X, ..., X,,)] in the previous exercise is 

1 ,/Oh 

BIMX y-Xul] Myo) + 303 (So 


149. 


+ a ly cali 
2 "\ dx 


Compute this for Y = h(X,, Xo, X3, X4) 
given in the previous exercise, and compare 
it to the leading term h({1y, ..., My). 


The following example is based on “Con- 
ditional Moments and Independence” (The 
American Statistician 2008: 219). Con- 
sider the following joint pdf of two rvs 
X and Y: 


en ((Inx)’ + (Iny)?]/2 


f(xy) = oa [1+ sin(27 In x) sin(27 In y)] 


forx >0, y>0 


a. Show that the marginal distribution of 
each rv is lognormal. [Hint: When 
obtaining the marginal pdf of X, make 
the change of variable u = In(y).] 

b. Obtain the conditional pdf of Y given 
that X = x. Then show that for every 
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positive integer n, E(Y"|X = x) = E(Y"). 
[Hint: Make the change of variable 
In(y) = u + n in the second integrand. ] 
c. Redo (b) with X and Y interchanged. 
d. The results of (b) and (c) suggest intu- 
itively that X and Y are independent rvs. 
Are they in fact independent? 


150. Let Yo denote the initial price of a particular 


security and Y,, denote the price at the end of 
nadditional weeks forn =1, 2,3, .... Assume 
that the successive price ratios Y;/Yo, Yo/Y,, 
Y3/Y>,... are independent of one another and 
that each ratio has a lognormal distribution 
with w = .4 and o = .8 (the assumptions of 
independence and lognormality are common 
in such scenarios). 


a. Calculate the probability that the secu- 
rity price will increase over the course 
of a week. 

b. Calculate the probability that the secu- 
rity price will be higher at the end of 
the next week, be lower the week after 
that, and then be higher again at the 
end of the following week. [Hint: What 
does “higher” say about the ratio 
Yin /Y;?] 

c. Calculate the probability that the secu- 
rity price will have increased by at least 
20% over the course of a five-week 
period. [Hint: Consider the ratio Y5/Yo, 
and write this in terms of successive 
ratios Y;,,/Y;.] 


®) 


Check for 
updates 


Introduction 


This chapter helps make the transition between probability and inferential statistics. Given a sample of 
n observations from a population, we will be calculating estimates of the population mean, median, 
standard deviation, and various other population characteristics (parameters). Prior to obtaining data, 
there is uncertainty as to which of all possible samples will occur. Because of this, estimates such 
as x, x, and s will vary from one sample to another. The behavior of such estimates in repeated sampling 
is described by what are called sampling distributions. Any particular sampling distribution will give an 
indication of how close the estimate is likely to be to the value of the parameter being estimated. 

The first two sections use probability results to study sampling distributions. A particularly 
important result is the Central Limit Theorem, which shows how the behavior of the sample mean can 
be described by a normal distribution when the sample size is large. The last two sections introduce 
several distributions related to samples from a normal population distribution. Many inferential 
procedures are based on properties of these sampling distributions. 


6.1. Statistics and Their Distributions 


The observations in a single sample were denoted in Chapter 1 by x), x2, ..., x,. Consider selecting 
two different samples of size n from the same population distribution. The x;’s in the second sample 
will virtually always differ at least a bit from those in the first sample. For example, a first sample of 
n = 3 cars of a particular model might result in fuel efficiencies x; = 30.7, x2 = 29.4, x3 = 31.1, 
whereas a second sample may give x, = 28.8, x. = 30.0, and x3 = 31.1. Before we obtain data, there 
is uncertainty about the value of each x;. Because of this uncertainty, before the data becomes 
available we view each observation as a random variable and denote the sample by Xj, Xo, ..., X, 
(uppercase letters for random variables). 

This variation in observed values in turn implies that the value of any function of the sample 
observations—such as the sample mean, sample standard deviation, or sample iqr—also varies from 
sample to sample. That is, prior to obtaining x), ..., x,,, there is uncertainty as to the value of x, the 
value of s, and so on. 


Example 6.1 Suppose that material strength for a randomly selected specimen of a particular type 
has a Weibull distribution with parameter values « = 2 (shape) and f = 5 (scale). The corresponding 
density curve is shown in Figure 6.1. Formulas from Section 4.5 give 
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Figure 6.1 The Weibull density curve for Example 6.1 


w=E(X)=44311 fr = 4.1628 0? =V(X) =5.365 o =2.316 


The mean exceeds the median because of the distribution’s positive skew. 

We used statistical software to generate six different samples, each with n = 10, from this dis- 
tribution (material strengths for six different groups of ten specimens each). The results appear in 
Table 6.1, followed by the values of the sample mean, sample median, and sample standard deviation 
for each sample. Notice first that the ten observations in any particular sample are all different from 
those in any other sample. Second, the six values of the sample mean are all different from each other, 
as are the six values of the sample median and the six values of the sample standard deviation. The 
same would be true of the sample 10% trimmed means, sample iqrs, and so on. 


Table 6.1 Samples from the Weibull distribution of Example 6.1 


Sample 
1 2 3 4 5 6 

Observation 

1 6.1171 5.07611 3.46710 1.55601 3.12372 8.93795 
2 4.1600 6.79279 2.71938 4.56941 6.09685 3.92487 
3 3.1950 4.43259 5.88129 4.79870 3.41181 8.76202 
4 0.6694 8.55752 5.14915 2.49759 1.65409 7.05569 
5 1.8552 6.82487 4.99635 2.33267 2.29512 2.30932 
6 5.2316 7.39958 5.86887 4.01295 2.12583 5.94195 
7 2.7609 2.14755 6.05918 9.08845 3.20938 6.74166 
8 10.2185 8.50628 1.80119 3.25728 3.23209 1.75468 
9 5.2438 5.49510 4.21994 3.70132 6.84426 4.91827 
10 4.5590 4.04525 2.12934 5.50134 4.20694 7.26081 
Mean 4.401 5.928 4.229 4.132 3.620 5.761 
Median 4.360 6.144 4.608 3.857 3.221 6.342 


SD 2.642 2.062 1.611 2.124 1.678 2.496 


6.1 Statistics and Their Distributions 359 


Furthermore, the value of the sample mean from any particular sample can be regarded as a point 
estimate (“point” because it is a single number, corresponding to a single point on the number line) of 
the population mean p, whose value is known to be 4.4311. None of the estimates from these six 
samples is identical to what is being estimated. The estimates from the second and sixth samples are 
much too large, whereas the fifth sample gives a substantial underestimate. Similarly, the sample 
standard deviation gives a point estimate of the population standard deviation, g = 2.316. All six of 
the resulting estimates are in error by at least a small amount. a 


In summary, the values of the individual sample observations vary from sample to sample, so in 
general the value of any quantity computed from sample data, and the value of a sample characteristic 
used as an estimate of the corresponding population characteristic, will virtually never coincide with 
what is being estimated. 


DEFINITION A statistic is any quantity whose value can be calculated from sample data. Prior 
to obtaining data, there is uncertainty as to what value of any particular statistic 
will result. Therefore, a statistic is a random variable and will be denoted by an 
uppercase letter; a lowercase letter is used to represent the calculated or observed 
value of the statistic. 


Thus the sample mean, regarded as a statistic (before a sample has been selected or an experiment has 
been carried out), is denoted by X; the calculated value of this statistic from a particular sample is x. 
Similarly, S represents the sample standard deviation thought of as a statistic, and its computed value 
is 5. 

Any statistic, being a random variable, has a probability distribution. The probability distribution 
of any particular statistic depends not only on the population distribution (normal, uniform, etc.) and 
the sample size n but also on the method of sampling. Our next definition describes a sampling 
method often encountered, at least approximately, in practice. 


DEFINITION The rvs Xj, X2, ..., X;, are said to form a (simple) random sample of size n if 


1. The X;’s are independent rvs. 

2. Every X; has the same probability distribution. 
Such a collection of random variables is also referred to as being independent 
and identically distributed (iid). 


If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 
are satisfied exactly. These conditions will be approximately satisfied if sampling is without 
replacement, yet the sample size n is much smaller than the population size N. In practice, if 
nIN < .05 (at most 5% of the population is sampled), we can proceed as if the X;’s form a random 
sample. The virtue of this sampling method is that the probability distribution of any statistic can be 
more easily obtained than for any other sampling method. 

The probability distribution of a statistic is sometimes referred to as its sampling distribution to 
emphasize that it describes how the statistic varies in value across all samples that might be selected. 
There are two general methods for obtaining information about a statistic’s sampling distribution. One 
method involves calculations based on probability rules, and the other involves carrying out a 
simulation experiment. 
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Deriving the Sampling Distribution of a Statistic 

Probability rules can be used to obtain the distribution of a statistic provided that it is a “fairly simple” 
function of the X;’s and either there are relatively few different X values in the population or else the 
population distribution has a “nice” form. Our next two examples illustrate such situations. 


Example 6.2 An online florist offers three different sizes for Mother’s Day bouquets: a small 
arrangement costing $80 (including shipping), a medium-sized one for $100, and a large one with a 
price tag of $120. If 20% of all purchasers choose the small arrangement, 30% choose medium, and 
50% choose large (because they really love Mom!), then the probability distribution of the cost of a 
single randomly selected flower arrangement is given by 


x 80 100 120 


with = 106, o* = 244 6.1 

pe) | 2 3 5 . eo) 
Suppose only two bouquets are sold today. Let X; = the cost of the first bouquet and X> = the cost of 
the second. Suppose that X, and X> are independent, each with the probability distribution shown in 
(6.1), so that X; and X> constitute a random sample from the distribution (6.1). Table 6.2 lists possible 
(x1, X2) pairs, the probability of each pair computed using (6.1) and the assumption of independence, 


and the resulting ¥ and s values. (Note that when n = 2, s? = (x; — ¥)° + (x) — ¥)”.) 


Table 6.2 Outcomes, probabilities, and values of ¥ and s* for Example 6.2 


xy Xp p(X, X2) x Ss 
80 80 (.2)(.2) = .04 80 0 
80 100 (.2)(.3) = .06 90 200 
80 120 (.2)(..5) = .10 100 800 
100 80 (.3)(.2) = .06 90 200 
100 100 (.3)(.3) = .09 100 0 
100 120 (.3)¢.5) = .15 110 200 
120 80 (.5)(.2) = .10 100 800 
120 100 (.5)(.3) = .15 110 200 
120 120 (.5)(5) = .25 120 0 


Now to obtain the probability distribution of X, the sample average cost per bouquet, we must 
consider each possible value x and compute its probability. For example, x = 100 occurs three times 
in the table with probabilities .10, .09, and .10, so 


P(X = 100) = 10+ .09 + .10 = .29 


Similarly, s* = 800 appears twice in the table with probability .10 each time, so 


P(S* = 800) = P(X; = 80, X, = 120) + P(X, = 120, X, = 80) 
= .10+.10 = .20 


The complete sampling distributions of X and S* appear in (6.2) and (6.3). 
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xX 80 90 100 110 120 


px(x) | 04 12 29 30 25 02) 


s° 0 200 800 


63 
ps:(s2) | 38 42 .20 (6.3) 


Figure 6.2 depicts a probability histogram for both the original distribution of X (6.1) and the X 
distribution (6.2). The figure suggests first that the mean (i.e., expected value) of X is equal to the 


mean $106 of the original distribution, since both histograms appear to be centered at the same place. 
Indeed, from (6.2), 


E(X) = S_ xpg(®) = 80(.04) + --- + 120(.25) = 106 = uw 


80 100 120 80 90 100 110 120 


Figure 6.2 Probability histograms for (a) the underlying population distribution 
and (b) the sampling distribution of X in Example 6.2 


Second, it appears that the X distribution has smaller spread (variability) than the original distribution, 
since the values of x are more concentrated toward the mean. Again from (6.2), 


V(X) = S28 — ng)" Pe(®) = DF — 106)’pz(x) 
= (80 — 106)*(.04) + --- + (120 — 106)7(.25) = 122 


Notice that V(X) = 122 = 244/2 = o?/2, exactly half the population variance; that is a consequence 
of the sample size n = 2, and we’ll see why in the next section. 
Finally, the mean value of S* is 


E(S’) = S— s*ps.(s*) = 0(.38) + 200(.42) + 800(.20) = 244 = o? 


That is, the X sampling distribution is centered at the population mean pu, and the S sampling 
distribution (histogram not shown) is centered at the population variance o 


If four flower arrangements had been purchased on the day of interest, the sample average cost X 
would be based on a random sample of four X;’s, each having the distribution (6.1). More calculation 
eventually yields the distribution of X for n = 4 as 


x 80 85 90 95 100 105 110 115 120 
P(X) .0016 .0096 .0376 .0936 1761 .2340 .2350 .1500 0625 
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From this, E(X) = 106 = and V(X) = 61 = o”/4. Figure 6.3 is a probability histogram of this 
distribution. 


80 90 100 110 120 


Figure 6.3 Probability histogram for X based on n = 4 in Example 6.2 & 


Example 6.2 should suggest first of all that the computation of py (x) and ps: (s”) can be tedious. If 
the original distribution (6.1) had allowed for more than the three possible values 80, 100, and 120, 
then even for n = 2 the computations would have been more involved. The example should also 
suggest, however, that there are some general relationships between E(X), V(X), E(S”), and the mean 
Land variance o° of the original distribution. These are stated in the next section. Now consider an 
example in which the random sample is drawn from a continuous distribution. 


Example 6.3 The time that it takes to serve a customer at the cash register in a minimarket is a 
random variable having an exponential distribution with parameter 1. Suppose X, and X> are service 
times for two different customers, assumed independent of each other. Consider the total service time 
T, = X, + X2 for the two customers, also a statistic. The cdf of T, is, for t > 0, 


Fr,(t) = P(X, + X2 <1) = / f(%1,%2) dx dx2 
{ 


(x1 ,X2):x1 +X2 < t} 
t t—x 


-// dew! . Jew" dxy dx} 
0 0 
t 


= ‘| (Ae~™ — de )dxy = 1 — e* — Ate 
0 


The region of integration is pictured in Figure 6.4. 
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> 
x) 


Figure 6.4 Region of integration to obtain cdf of T, in Example 6.3 
The pdf of 7, is obtained by differentiating Fr, (t): 
— 424,—A4t 
fr,(t) = Ate t>0 (6.4) 


This is a gamma pdf (« = 2 and f = 1/2). This distribution for 7, can also be derived by convolution 
or by the moment generating function argument from Section 5.3. 

Since Fz(X) = P(X <x) = P(T, < 2x) = Fry, (2x), differentiating with respect to x and using (6.4) 
plus the chain rule gives us the pdf of X = T,/2: 


fr(®) = 4 %e7* =x>0 (6.5) 


The mean and variance of the underlying exponential distribution are = 1/A and o* = 1/27. Using 
Expressions (6.4) and (6.5), it can be verified that E(X) = 1/4, V(X) = 1/(227), E(T)) = 2/A, and 
V(T.) = 2/ 22. These results again suggest some general relationships between means and variances 
of X, T,, and the underlying distribution. f=] 


Simulation Experiments 

The second method of obtaining information about a statistic’s sampling distribution is to perform a 
simulation experiment. This method is often used when a derivation via probability rules or properties of 
distributions is too difficult or complicated to be carried out. Simulations are virtually always done with 
the aid of computer software. The following characteristics of a simulation experiment must be specified: 


ae 


. The statistic of interest (X, S, a particular trimmed mean, etc.) 

2. The population distribution (normal with w = 100 and o = 15, uniform with lower limit A = 5 and 
upper limit B = 10, etc.) 

3. The sample size n (e.g., n = 10 or n = 50) 

4. The number of replications k (e.g., k = 10,000). 


Then use a computer to obtain k different random samples, each of size n, from the designated 
population distribution. For each such sample, calculate the value of the statistic and construct a 
histogram of the k calculated values. This histogram gives the approximate sampling distribution of 
the statistic. The larger the value of k, the better the approximation will tend to be (the actual sampling 
distribution emerges as k — oo). In practice, k = 10,000 may be enough for a “fairly simple” statistic 
and population distribution, but modern computers allow for a much larger number of replications. 
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Example 6.4 Consider a simulation experiment in which the population distribution is quite skewed. 
Figure 6.5 shows the density curve for lifetimes of a certain type of electronic control. This is actually 
a lognormal distribution with E[In(X)] = 3 and V[In(X)] = 0.16; that is, In(X) is normal with mean 3 
and standard deviation 0.4. 
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Figure 6.5 Density curve for the simulation experiment of Example 6.4: 
a lognormal distribution with E(X) = 21.76 and V(X) = 82.14 


Imagine the statistic of interest is the sample mean, X. For any given sample size n, we repeat the 
following procedure k times: 


e Generate values x;, ..., x, from a lognormal distribution with the specified parameter values; 
equivalently, generate y;, ..., y, from a N(3, 0.4) distribution and apply the transformation x = e” 
to each value. 

e Calculate and store the sample mean x of the n x-values. 


We performed this simulation experiment at four different sample sizes: n = 5, 10, 20, and 30. The 
experiment utilized k = 1000 replications (a very modest value) for each sample size. The resulting 
histograms along with a normal probability plot from R for the 1000 x values based on n = 30 are 
shown in Figure 6.6 on the next page. 

The first thing to notice about the histograms is that each one is centered approximately at the 
mean of the population being sampled, wy = e** 0.16/2 ~~ 21.76. Had the histograms been based on an 
unending sequence of x values, their centers would have been exactly at the population mean. 

Second, note the spread of the histograms relative to each other. The smaller the value of n, the 
greater the extent to which the sampling distribution spreads out about the mean value. This is why 
the histograms for n = 20 and n = 30 are based on narrower class intervals than those for the two 
smaller sample sizes. For the larger sample sizes, most of the x values are quite close to yy. This is the 
effect of averaging. When n is small, a single unusual x value can result in an x value far from the 
center. With a larger sample size, any unusual x values, when averaged in with the other sample 
values, still tend to yield an x value close to 4x. Combining these insights yields an intuitively- 
appealing result: X based on a large n tends to be closer to ys than does X based on a small n. 

Third and finally, consider the shapes of the histograms. Recall from Figure 6.5 that the population 
from which the samples were drawn is quite skewed. But as the sample size n increases, the 


distribution of X appears to become progressively less skewed. In particular, when n = 30 the 
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distribution of the 1000 x values appears to be approximately normal, a fact validated by the normal 
probability plot in Figure 6.6e. We will discover in the next section that this is part of a much broader 
phenomenon known as the Central Limit Theorem: as the sample size n increases, the sampling 
distribution of X becomes increasingly normal, irrespective of the population distribution from which 
values were sampled. 
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Figure 6.6 Results of the simulation experiment of Example 6.4: (a) X histogram for n = 5; 
(b) X histogram for n = 10; (c) X histogram for n = 20; (d) X histogram for n = 30; 
(e) normal probability plot for n = 30 (from R) a 


366 6 Statistics and Sampling Distributions 


Example 6.5 The 2017 study described in Example 4.23 determined that the variable X = proximal 
grip distance for female surgeons follows a normal distribution with mean 6.58 cm and standard 
deviation 0.50 cm. Consider the statistic Q; = the sample 25th percentile (equivalently, the lower 
quartile). To investigate the sampling distribution of Q,; we repeated the following procedure 
k = 1000 times: 


e Generate a sample x), ..., x, from the N(6.58, 0.50) distribution. 
e Calculate and store the lower quartile, q,, of the n resulting x values. 


The results of two such simulation experiments—one for n = 5, another for n = 40—are shown in 
Figure 6.7. Similar to X’s behavior in the previous example, we see that the sampling distribution of Q, 
has greater variability for small n than for large n. Both sampling distributions appear to be centered 
roughly at 6.5 cm, which is perhaps not surprising: the 25th percentile of the population distribution is 


Nos = H+ !(.25) «6 = 6.83 + (—0.675) (0.50) © 6.49 cm 


a n=5 b 7n=40 
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Figure 6.7 Sample histograms of Q, based on 1000 samples, each consisting of n observations: (a) n = 5, (b) n = 40 


In fact, even with an infinite set of replications (i.e., the “true” sampling distribution), the mean of 
Q, is not exactly 755, but that difference decreases as n increases. re 


Exercises: Section 6.1 (1-10) 


1. A particular brand of dishwasher soap is a. Determine the sampling distribution of 
sold in three sizes: 25, 40, and 65 oz. 20% X, calculate E(X), and compare to pu. 
of all purchasers select a 25-0z box, 50% b. Determine the sampling distribution of 
select a 40-oz box, and the remaining 30% the sample variance S?, calculate E(S”), 
choose a 65-0z box. Let X; and X> denote and compare to o. 


Ihe Package Sizes: Selecied by. EWO: tude: 2. There are two traffic lights on the way to 


pendently selected purchasers. work. Let X, be the number of lights that 
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are red, requiring a stop, and suppose that 
the distribution of X, is as follows: 


XY 0 1 
p= 11, 02 = 49 
po) 2 5 3 


Let X> be the number of lights that are red 

on the way home; X> is independent of X,. 

Assume that X> has the same distribution as 

X,, so that X,, Xz is a random sample of 

size n = 2. 

a. Let T, = X,; + X>, and determine the 
probability distribution of T,. 

b. Calculate j4;,. How does it relate to p, 
the population mean? 

c. Calculate OF. How does it relate to 0°, 
the population variance? 


. It is known that 80% of all Brand A MP3 


players work in a satisfactory manner 
throughout the warranty period (are “suc- 
cesses”). Suppose that n = 10 players are 
randomly selected. Let X = the number of 
successes in the sample. The statistic X/n is 
the sample proportion (fraction) of suc- 
cesses. Obtain the sampling distribution of 
this statistic. [Hint: One possible value of 
X/n is .3, corresponding to X = 3. What is 
the probability of this value (what kind of 
random variable is X)?] 


A box contains ten sealed envelopes num- 
bered 1, ..., 10. The first five contain no 
money, the next three each contain $5, and 
there is a $10 bill in each of the last two. 
A sample of size 3 is selected with 
replacement (so we have a random sample), 
and you get the largest amount in any of the 
envelopes selected. If X;, X2, and X3 denote 
the amounts in the selected envelopes, the 
statistic of interest is M = the maximum of 
X 1, X>, and X3. 
a. Obtain the probability distribution of 
this statistic. 
b. Describe how you would carry out a 
simulation experiment to compare the 
distributions of M for various sample 
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sizes. How would you guess the distri- 
bution would change as n increases? 


5. Let X be the number of packages being 


mailed by a randomly selected customer at 
a shipping facility. Suppose the distribution 
of X is as follows: 


x |l2 3 4 
po) 4 3 2 4 


a. Consider a random sample of size n = 2 
(two customers), and let X be the sam- 
ple mean number of packages shipped. 
Obtain the sampling distribution of X. 

b. Refer to part (a) and _ calculate 
P(E < 25). 

c. Again consider a random sample of size 
n= 2, but now focus on the statistic 
R=the sample range (difference 
between the largest and smallest values 
in the sample). Obtain the sampling 
distribution of R. [Hint: Calculate the 
value of R for each outcome and use the 
probabilities from part (a).] 

d. If a random sample of size n = 4 is 
selected, what is P(X <1.5)? [Hint: 
You should not have to list all possible 
outcomes, only those for which 
X<1.5.] 


. A company maintains three offices in a 


region, each staffed by two employees. 
Information concerning yearly _ salaries 
(1000s of dollars) is as follows: 


Office 1 1 ) 2 3 3 
Employee | 2 3 4 5 6 
Salary 29.7 33.6 30.2 33.6 25.8 29.7 


a. Suppose two of these employees are 
randomly selected from among the six 
(without replacement). Determine the 
sampling distribution of the sample 
mean salary X. 

b. Suppose one of the three offices is ran- 
domly selected. Let X; and X> denote the 
salaries of the two employees. Deter- 
mine the sampling distribution of X. 
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c. How does E(X) from parts (a) and 9. Carry out a simulation experiment using a 
(b) compare to the population mean statistical computer package or other soft- 
salary 1? ware to study the sampling distribution of X 


when the population distribution is Weibull 
with « = 2 and f =5, as in Example 6.1. 
Consider the four sample sizes n = 5, 10, 
20, and 30, and in each case use at least 
1000 replications. For which of these 
sample sizes does the X sampling distribu- 
tion appear to be approximately normal? 


7. The number of dirt specks on a randomly 
selected square yard of polyethylene film of 
a certain type has a Poisson distribution 
with a mean value of 2 specks per square 
yard. Consider a random sample of n = 5 
film specimens, each having area | square 
yard, and let X be the resulting sample 
mean number of dirt specks. Obtain the first 10. Carry out a simulation experiment using a 


21 probabilities in the X sampling distri- Statistical computer package or other 
bution. [Hint: What does a moment gener- software to study the sampling distribution 
ating function argument say about the of X when the population distribution is 
distribution of X; + --- + X52 lognormal with E[In(X)] = 3 and V{In(X)] 


= 1. Consider the four sample sizes 
n= 10, 20, 30, and 50, and in each case 
use at least 1000 replications. For which 


8. Suppose the amount of liquid dispensed by 
a machine is uniformly distributed with 
lower limit A=8 oz and upper limit 
B = 10 oz. Describe how you would carry 
out simulation experiments to compare the 
sampling distribution of the sample iqr for 
sample sizes n = 5, 10, 20, and 30. 


of these sample sizes does the X sampling 
distribution appear to be approximately 
normal? 


6.2 The Distribution of Sample Totals, Means, and Proportions 


Throughout this section, we will be primarily interested in the properties of two particular rvs derived 
from random samples: the sample total T, and the sample mean X: 


To =Xi +--+ +X, = > °Xi, X= ht _ 


The importance of the sample mean X springs from its use in drawing conclusions about the pop- 
ulation mean yu. Some of the most frequently used inferential procedures are based on properties of 
the sampling distribution of X. A preview of these properties appeared in the calculations and 


simulation experiments of the previous section, where we noted relationships between E(X) and yw 
and also among V(X), o°, and n. 


PROPOSITION Let X, Xo, ..., X,, be a random sample from a distribution with mean value py and 
standard deviation o. Then 


1. E(T,) = nu 1. E(X) =u 
o 
Jn 
3. If the X;’s are normally distributed, 3. If the X;’s are normally distributed, 


then T, is also normally then X is also normally distributed. 
distributed. 


2 
2. V(T,) = no? and or, = /no 2. V(X) = and ox = 
n 
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Proof From the main theorem of Section 5.3, the expected value of a sum is the sum of the 
individual expected values; moreover, when the variables in the sum are independent, the variance of 
the sum is the sum of the individual variances: 


E(T,) = E(X, +--+. +X,) = E(X))+--- + E(X,) = wt +h =ny 
V(To) = V(Xi+ +++» + Xn) = V(X) + + + V(X) = OP + + $0? =n? 
Or, = Vno? = \/no 


The corresponding results for X can be derived by writing X = ‘ T, and using basic rescaling 
I 


properties, such as E(cY) = cE(Y). Property 3 is a consequence of the more general result from 
Section 5.3 that any linear combination of independent normal rvs is normal. a 


According to Property 1, the distribution of X is centered precisely at the mean of the population 
from which the sample has been selected. If the sample mean is used to compute an estimate 
(educated guess) of the population mean ju, there will be no systematic tendency for the estimate to be 
too large or too small. 

Property 2 shows that the X distribution becomes more concentrated about as the sample size 
n increases, because its standard deviation decreases. In marked contrast, the distribution of T, 
becomes more spread out as n increases. Averaging moves probability in toward the middle, whereas 
totaling spreads probability out over a wider and wider range of values. The expression o/,/n for the 
standard deviation of X is called the standard error of the mean, and it indicates the typical amount 
by which a value of X will deviate from the true mean, (in contrast, o itself represents the typical 
difference between an individual X; and 2). 

When o is unknown, as is usually the case when wu is unknown and we are trying to estimate it, we 
may substitute the sample standard deviation, s, of our sample into the standard error formula and say 
that an observed value of X will typically differ by about s/./n from yu. This is the estimated standard 
error formula presented in Sections 3.8 and 4.8. 

Finally, Property 3 says that X and T,, are both normally distributed when the population distri- 
bution is normal. In particular, probabilities such as P(a< xX< b) and P(c < T, < d) can be 
obtained simply by standardizing, with the appropriate means and standard deviations provided by 
Properties 1 and 2. Figure 6.8 illustrates the X part of the proposition. 


X distribution 
when n= 10 


X distribution 
when n=4 


Population 
distribution 


Figure 6.8 A normal population distribution and X sampling distributions 


Example 6.6 The amount of time that a patient spends in a certain outpatient surgery center is a 
random variable with a mean value of 4.5 h and a standard deviation of 1.4 h. Let Xj, ..., Xo5 be the 
times for a random sample of 25 patients. Then the expected total time for the 25 patients is 
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E(T,) = nu = 25(4.5) = 112.5 h, whereas the expected sample mean amount of time is 
E(X) = « = 4.5h. The standard deviations of T, and X are 


or, = Vno = V25(1.4) =7h 


o 1.4 8h 
oy; = —- = — = .. 
to a/n 4/25 


Suppose further that such patient times follow a normal distribution; i.e., X; ~ N(4.5, 1.4). Then the 
total time spent by 25 randomly selected patients in this center is also normal: T, ~ N(112.5, 7). The 
probability their total time exceeds five days (120 h) is 


120 — 112.5 
7 


P(T, > 120) = 1—P(T, <120) =1 of j= (1.07) = .8577 


This same probability can be reframed in terms of X: for 25 patients, a total time of 120 h equates to 
an average time of 120/25 = 4.8 h, and since X ~ N(4.5, .28), 


4.8 —4.5 
28 


P(X >4.8)=1 of j= (1.07) = .8577 FS] 


Example 6.7 Resistors used in electronics manufacturing are labeled with a “nominal” resistance as 
well as a percentage tolerance. For example, a 330-Q resistor with a 5% tolerance is anticipated to 
have an actual resistance between 313.5 and 346.5 Q. Consider five such resistors, randomly selected 
from the population of all resistors with those specifications, and model the resistance of each by a 
uniform distribution on [313.5, 346.5]. If these are connected in series, the resistance R of the system 
is given by R= X,;+ --: +X5, where the X;’s are the iid uniform resistances. 

A random variable uniformly distributed on [A, B] has mean (A + B)/2 and standard deviation 
(B — A)/V/12. For our uniform model, the mean resistance is E(X;) = (313.5 + 346.5)/2 = 330 Q, the 
nominal resistance, with a standard deviation of (346.5 — 313.5)/V/12 = 9.526Q. The system’s 
resistance has mean and standard deviation 


E(R) = nu = 5(330) = 16509, op = Vno = V5(9.526) = 21.39 


But what is the probability distribution of R? Is R also uniformly distributed? Determining the exact pdf 
of R is difficult (it requires four convolutions). And the mgf of R, while easy to obtain, is not recog- 
nizable as coming from any particular family of known distributions. Instead, we resort to a simulation 
of R, the results of which appear in Figure 6.9. For 10,000 iterations, five independent uniform variates 
on [313.5, 346.5] were created and summed; see Section 4.8 for information on simulating values from 
a uniform distribution. The histogram in Figure 6.9 clearly indicates that R is not uniform; in fact, if 
anything, R appears (from the simulation, anyway) to be approximately normally distributed! 
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Figure 6.9 Simulated distribution of the random variable R in Example 6.7 a 


The Central Limit Theorem 
When iid X,’s are normally distributed, so are T, and X for every sample size n. The simulation results 
from Example 6.7 suggest that even when the population distribution is not normal, summing (or 
averaging) produces a distribution more bell-shaped than the one being sampled. Upon reflection, this 
is quite intuitive: in order for R to be near 5(346.5) = 1732.5, its theoretical maximum, all five 
randomly selected resistors would have to exert resistances at the high end of their common range 
(i.e., every X; would have to be near 346.5). Thus, R-values near 1732.5 are unlikely, and the same 
applies to R’s theoretical minimum of 5(313.5) = 1567.5. On the other hand, there are many ways for 
R to be near the mean value of 1650: all five resistances in the middle, two low and one middle and 
two high, and so on. Thus, R is more likely to be “centrally” located than out at the extremes. (This is 
analogous to the well-known fact that rolling a pair of dice is far more likely to result in a sum of 7 
than 2 or 12, because there are more ways to obtain 7.) 

This general pattern of behavior for sample totals and sample means is formalized by the most 
important theorem of probability, the Central Limit Theorem (CLT). 


CENTRAL LIMIT [et Xj, X5, ..., X, be a random sample from a distribution with mean p and 
THEOREM (CLT) standard deviation c. Then, in the limit as n — oo, the standardized versions of 
X and T,, have the standard normal distribution. That is, 


tim »(=— « z) = P(Z<z) = Oz) 
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and 


lim p(t < z) = P(Z<2) = 2) 
where Z is a standard normal rv. It is customary to say that X and T, are 
asymptotically normal, and that their standardized versions converge in 
distribution to Z. Thus when n is sufficiently large, X has approximately a 
normal distribution with uy = wand a, = a/ /n. Equivalently, for large n the 
sum 7, has approximately a normal distribution with pj, = nu and o7, = Jno. 


Figure 6.10 illustrates the Central Limit Theorem. A partial proof of the CLT appears in the appendix 
to this chapter. It is shown that, if the moment generating function exists, then the mgf of the 
standardized X (and of T,) approaches the standard normal mgf. With the aid of an advanced 
probability theorem, this implies the CLT statement about convergence of probabilities. 


X distribution for 
large n (approximately normal) 


X distribution for 
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Population “ 
distribution 
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Figure 6.10 The Central Limit Theorem for X illustrated 


A practical difficulty in applying the CLT is in knowing when n is “sufficiently large.” The 
problem is that the accuracy of the approximation for a particular n depends on the shape of the 
original underlying distribution being sampled. If the underlying distribution is symmetric and there 
is not much probability far out in the tails, then the approximation will be good even for a small n, 
whereas if it is highly skewed or has “heavy” tails, then a large n will be required. For example, if the 
distribution is uniform on an interval, then it is symmetric with no probability in the tails, and the 
normal approximation is very good for n as small as 10 (in Example 6.9, even for n = 5, the 
distribution of the sample total appeared rather bell-shaped). However, at the other extreme, a 
distribution can have such fat tails that its mean fails to exist and the Central Limit Theorem does not 
apply, so no n is big enough. A popular, although frequently somewhat conservative, convention is 
that the Central Limit Theorem may be safely applied when n > 30. Of course, there are exceptions, 
but this rule applies to most distributions of real data. 


Example 6.8 When a batch of a certain chemical product is prepared, the amount of a particular 
impurity in the batch is a random variable with mean value 4.0 g and standard deviation 1.5 g. If 50 
batches are independently prepared, what is the (approximate) probability that the sample average 
amount of impurity X is between 3.5 and 3.8 g? According to the convention mentioned above, 
n = 50 is large enough for the CLT to be applicable. The sample mean X then has approximately a 


normal distribution with mean value uy = 4.0 and og = 1.5/V50 = .2121, so 
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z 3.5—4.0 3.8—4.0 
P(3.5<X<3.8) xP age 
(3.5SX<3.8) ( a ) 
= 0(—.94) — ©(—2.36) = .1645 | 


Example 6.9 Suppose the number of times a randomly selected customer of a large bank uses the 
bank’s ATM during a particular period is a random variable with a mean value of 3.2 and a standard 
deviation of 2.4. Among 100 randomly selected customers, how likely is it that the sample mean 
number of times the bank’s ATM is used exceeds 4? Let X; denote the number of times the ith 
customer in the sample uses the bank’s ATM. Notice that X; is a discrete rv, but the CLT is not limited 
to continuous random variables. Also, although the fact that the standard deviation of this nonneg- 
ative variable is quite large relative to the mean value suggests that its distribution is positively 
skewed, the large sample size implies that X does have approximately a normal distribution. Using 


ly = 3.2 and oy = o/\/n = 2.4/V100 = .24, 


= 3.2 
24 


- 4 
P(X >4)& P(z > ) = 1 — 0(3.33) = .0004 - 


Example 6.10 Consider the distribution shown in Figure 6.11 for the amount purchased (rounded to 
the nearest dollar) by a randomly selected customer at a particular gas station. (A similar distribution 
for purchases in Britain (in £) appeared in the article “Data Mining for Fun and Profit,” Stat. Sci. 
2000: 111-131; there were big spikes at the values 10, 15, 20, 25, and 30.) The distribution is 
obviously quite nonnormal. 
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Figure 6.11 Probability distribution of X = amount of gasoline purchased ($) in Example 6.10 


We asked R to select 1000 different samples, each consisting of n = 15 observations, and calculate 
the value of the sample mean X for each one. Figure 6.12 is a histogram of the resulting 1000 values; 
this is the approximate sampling distribution of X under the specified circumstances. This distribution 
is clearly approximately normal even though the sample size is not all that large. As further evidence 
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for normality, Figure 6.13 shows a normal probability plot of the 1000 x values; the linear pattern is 
very prominent. It is typically not nonnormality in the central part of the population distribution that 
causes the CLT to fail, but instead very substantial skewness or extremely heavy tails. 
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Figure 6.12 Approximate sampling distribution of the sample mean amount purchased when n = 15 and the 
population distribution is as shown in Figure 6.11 
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Figure 6.13 Normal probability plot of the 1000 x values based on samples of size n = 15 a 


The CLT can also be generalized so it applies to nonidentically-distributed independent random 
variables and certain linear combinations. Roughly speaking, if n is large and no individual term is 
likely to contribute too much to the overall value, then asymptotic normality prevails (see Exercise 
68). It can also be generalized to sums of variables which are not independent, provided the extent of 
dependence between most pairs of variables is not too strong. 
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Other Applications of the Central Limit Theorem 
The CLT can be used to justify the normal approximation to the binomial distribution discussed 
in Chapter 4. Recall that a binomial variable X is the number of successes in a binomial 


experiment consisting of n independent success/failure trials with p = P(success) for any 
particular trial. Define new rvs X), Xo, ..., X, by 
X= 1 if the th trial results in a success Ge in 
‘| 0 if the ith trial results in a failure ie ae 


Because the trials are independent and P(success) is constant from trial to trial, the X;’s are iid (a 
random sample from a Bernoulli distribution). When the X;’s are summed, a | is added for every 
success that occurs and a 0 for every failure so X = X; + --- + X,, their total. The sample mean 
of the X;’s is X = X/n, the sample proportion of successes, which in previous discussions we 
have denoted P. The CLT then implies that if n is sufficiently large, both X and P are 
approximately normal when n is large. We summarize properties of the P distribution in the 
following corollary; Statements | and 2 were derived in Section 3.5. 
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COROLLARY Consider an event A in the sample space of some experiment with p = P(A). Let 
X = the number of times A occurs when the experiment is repeated n independent 


times, and define 


~ a xX 
P=P(A) =— 
(4) == 
Then 
1. E(P) =p 
: fies i= 
2. v(p) P=?) and Op = p(l—p) 
n n 


3. As n increases, the distribution of P approaches a normal distribution. 


In practice, Property 3 is taken to say that P is approximately normal, provided that 


np = 10andn(1—-p) = 10. 


The necessary sample size for this approximation depends on the value of p: When p is close to .5, the 
distribution of each Bernoulli X; is reasonably symmetric (see Figure 6.14), whereas the distribution is 
quite skewed when p is near 0 or 1. Using the approximation only if bothnp > 10 andn(1—p) > 10 
ensures that n is large enough to overcome any skewness in the underlying Bernoulli distribution. 


a b | 
0 1 0 1 


Figure 6.14 Two Bernoulli distributions: (a) p = .4 (reasonably symmetric); (b) p = .1 (very skewed) 
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Example 6.11 A computer simulation in the style of Section 2.6 is used to determine the probability 
that a complex system of components operates properly throughout the warranty period. Unknown to 
the investigator, the true probability is P(A) = .18. If 10,000 simulations of the underlying process are 
run, what is the chance the estimated probability P(A) will be within .01 of the true probability P(A)? 

Apply the preceding corollary, with n = 10,000 and p = P(A) = .18. The expected value of P(A) is 
p = .18, and the standard deviation is op = \/.18(.82)/10,000 = .00384. Since np = 1800 > 10 
and n(1 — p) = 8200 > 10, a normal distribution can safely be used to approximate the distribution 
of P(A). This sample proportion is within .01 of the true probability if and only if .17< P(A) <.19, so 
the desired likelihood is approximately 


17 — .18 19 — .18 


P(.17 < P< .19) = P( —__ < Z < —____ 
.00384 .00384 


) = (2.60) — ®(—2.60) = .9906 


The normal distribution serves as a reasonable approximation to the binomial pmf when 7 is large 
because the binomial distribution is additive; i.e., a binomial rv can be expressed as the sum of other, 
iid rvs. Other additive distributions include the Poisson, negative binomial, gamma, and (of course) 
normal distributions; some of these were discussed at the end of Section 5.3. In particular, CLT 
justifies normal approximations to the following distributions: 


e Poisson, when y is large 
e Negative binomial, when r is large 
e Gamma, when «@ is large 


As a final application of the CLT, first recall from Section 4.5 that X has a lognormal distribution if 
In(X) has a normal distribution. 


PROPOSITION Let X,, X2, ..., X,, be a random sample from a distribution for which only positive 
values are possible [P(X; > 0) = 1]. Then if n is sufficiently large, the product 
Y=X, Xo,°°°-- , X, has approximately a lognormal distribution; that is, 
In(Y) has a normal distribution. 


To verify this, note that 
In(Y) = In(X,) + In(Xy) + --+ + In(X,) 


Since In(Y) is a sum of independent and identically distributed rvs [the In(X;)’s], it is approximately 
normal when n is large, so Y itself has approximately a lognormal distribution. As an example of the 
applicability of this result, it has been argued that the damage process in plastic flow and crack 
propagation is a multiplicative process, so that variables such as percentage elongation and rupture 
strength have approximately lognormal distributions. 


The Law of Large Numbers 

In the simulation sections of Chapters 2-4, we described how a sample proportion P could estimate a 
true probability p, and a sample mean X served to approximate a theoretical expected value w. 
Moreover, in both cases the precision of the estimation improves as the number of simulation runs, n, 
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increases. We would like to be able to say that our estimates “converge” to the correct answers in 
some sense. Such a convergence statement is justified by another important theoretical result, called 
the Law of Large Numbers. 

To begin, recall the first proposition in this section: If X,, Xo, ..., X,, is a random sample from a 
distribution with mean y and standard deviation o, then E(X) = u and V(X) = o7/n. As n increases, 
the expected value of X remains at jm but the variance approaches zero; that is, 
E((X — w)° = V(X) = 0? /n — 0. We say that X converges in mean square to s because the mean of 
the squared difference between X and yp goes to zero. This is one form of the Law of Large Numbers. 

Another form of convergence states that as the sample size n increases, X is increasingly unlikely 
to differ by any set amount from yw. More precisely, let ¢ be a positive number close to 0, such as .01 
or .001, and consider P(|X = | > ¢), the probability that X differs from y by at least ¢ (at least .01, at 
least .001, etc.). We will prove shortly that, no matter how small the value of ¢, this probability will 
approach zero as n — oo. Because of this, statisticians say that X converges to in probability. 

The two forms of the Law of Large Numbers are summarized in the following theorem. 


LAW OF LARGE If X,, X2,..., X,, is a random sample from a distribution with mean p, 
NUMBERS then X converges to yu 


1. in mean square: E[(X — p)”] > 0 as n > co 
2. in probability: P(|X - | >e) > 0 as n — oo for any ¢ > 0. 


Proof The proof of Statement 1 appears a few paragraphs above. For Statement 2, recall 
Chebyshev’s inequality (Exercises 45 and 163 in Chapter 3), which states that for any rv Y, 
P(Y — uy| > koy) < Wk for any k > 1 (ie., the probability that Y is at least k standard 
deviations away from its mean is at most I/k’). Let Y=X, so py —=E(X)=w and 
dy = 0x=<0/,/n. Now, for any ¢>0, determine the value of k such that ¢= kay = ka/,/n; 
solving for k yields k = é\/n/o, which for sufficiently large n will exceed 1. Apply Chebyshev’s 
inequality: 


1 = 
P(|Y — py| 2 koy) < aaals w= 


é/n =) < 1 

a Va) = (enjoy 
2 

= P(|K—4|>e) <  —0asn— 00 
en 


That is, P(|X — | > 2) + 0 — 0 as n — oo for any ¢ > 0. a 


Convergence of X to wu in probability actually holds even if the variance o* does not exist (a heavy- 
tailed distribution) as long as yw is finite. But then Chebyshev’s inequality cannot be used, and the 
proof is much more complicated. 

In statistical language, the Law of Large Numbers states that X is a consistent estimator of u. Other 
statistics are also consistent estimators of the corresponding parameters. For example, it can be shown 
that the sample proportion P is a consistent estimator of the population proportion p (Exercise 24), 
and the sample variance S* = 37 (X;— X)°/(n—1) is a consistent estimator of the population 
variance a”. 
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Exercises: Section 6.2 (11-27) 


11. 


12. 


13. 


Sth 
69.6 


The inside diameter of a randomly selected 
piston ring is a random variable with mean 
value 12 cm and standard deviation .04 cm. 


a. If X is the sample mean diameter for a 
random sample of n = 16 rings, where 
is the sampling distribution of X cen- 
tered, and what is the standard deviation 
of the X distribution? 

b. Answer the questions posed in part 
(a) for a sample size of n = 64 rings. 

c. For which of the two random samples, 
the one of part (a) or the one of part (b), 
is X more likely to be within .01 cm of 
12 cm? Explain your reasoning. 


Refer to the previous exercise. Suppose the 
distribution of diameter is normal. 


a. Calculate P(11.99<X<12.01) when 
n= 16. 

b. How likely is it that the sample mean 
diameter exceeds 12.01 when n = 25? 


The National Health Statistics Reports dated 
Oct. 22, 2008 stated that for a sample size of 
277 18-year-old American males, the sam- 
ple mean waist circumference was 86.3 cm. 
A somewhat complicated method was used 
to estimate various population percentiles, 
resulting in the following values: 


10th 
70.9 


25th 
75.2 


50th 
81.3 


75th 
95.4 


90th 
107.1 


95th 
116.4 


a. Is it plausible that the waist size distri- 
bution is at least approximately normal? 
Explain your reasoning. If your answer 
is no, conjecture the shape of the pop- 
ulation distribution. 

b. Suppose that the population mean waist 
size is 85 cm and that the population 
standard deviation is 15 cm. How likely 
is it that a random sample of 277 indi- 
viduals will result in a sample mean 
waist size of at least 86.3 cm? 

c. Referring back to (b), suppose now that 
the population mean waist size is 82 cm 
(closer to the median than the mean). 


14. 


15. 


16. 


17. 
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Now what is the (approximate) proba- 
bility that the sample mean will be at 
least 86.3? In light of this calculation, 
do you think that 82 is a reasonable 
value for 1? 


There are 40 students in an elementary 
statistics class. On the basis of years of 
experience, the instructor knows that the 
time needed to grade a randomly chosen 
first examination paper is a random variable 
with an expected value of 6 min and a 
standard deviation of 6 min. 


a. If grading times are independent and the 
instructor begins grading at 6:50 p.m. 
and grades continuously, what is the 
(approximate) probability that he is 
through grading before the 11:00 p.m. 
TV news begins? 

b. Ifthe sports report begins at 11:10, what is 
the probability that he misses part of the 
report if he waits until grading is done 
before turning on the TV? 


The tip percentage at a restaurant has a 
mean value of 18% and a standard devia- 
tion of 6%. 


a. What is the approximate probability that 
the sample mean tip percentage for a 
random sample of 40 bills is between 16 
and 19%? 

b. If the sample size had been 15 rather 
than 40, could the probability requested 
in part (a) be calculated from the given 
information? 


The time taken by a randomly selected 
applicant for a mortgage to fill out a cer- 
tain form has a normal distribution with 
mean value 10 min and standard deviation 
2 min. If five individuals fill out a form on 
one day and six on another, what is the 
probability that the sample average amount 
of time taken on each day is at most 
11 min? 

The lifetime of a type of battery is normally 
distributed with mean value 10h and 
standard deviation 1h. There are four 
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18. 


19. 


20. 


batteries in a package. What lifetime value 
is such that the total lifetime of all batteries 
in a package exceeds that value for only 5% 
of all packages? 


Let X represent the amount of gasoline 
(gallons) purchased by a randomly selected 
customer at a gas station. Suppose that the 
mean value and standard deviation of X are 
11.5 and 4.0, respectively. 


a. Ina sample of 50 randomly selected cus- 
tomers, what is the approximate proba- 
bility that the sample mean amount 
purchased is at least 12 gallons? 

b. In a sample of 50 randomly selected 
customers, what is the approximate 
probability that the total amount of 
gasoline purchased is at most 600 
gallons? 

c. What is the approximate value of the 
95th percentile for the total amount 
purchased by 50 randomly selected 
customers? 


Suppose that the fracture angle under pure 
compression of a randomly selected speci- 
men of fiber reinforced polymer-matrix 
composite material is normally distributed 
with mean value 53 and standard deviation 
1 (suggested in the article “Stochastic 
Failure Modelling of Unidirectional Com- 
posite Ply Failure,” Reliability Engr. Syst. 
Safety 2012: 1-9; this type of material is 
used extensively in the aerospace industry). 


a. If a random sample of 4 specimens is 
selected, what is the probability that the 
sample mean fracture angle is at most 
54? Between 53 and 54? 

b. How many such specimens would be 
required to ensure that the first proba- 
bility in (a) is at least .999? 

The first assignment in a statistical com- 

puting class involves running a short pro- 

gram. If past experience indicates that 40% 

of all students will make no programming 

errors, compute the (approximate) proba- 
bility that in a class of 50 students 


21. 


22. 


23. 


24. 


25; 
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a. At least 25 will make no errors. [Hint: 
Normal approximation to the binomial.] 

b. Between 15 and 25 (inclusive) will 
make no errors. 


The number of parking tickets issued in a 
certain city on any given weekday has a 
Poisson distribution with parameter uw = 50. 
What is the approximate probability that 


a. Between 35 and 70 tickets are given out 
on a particular day? [Hint: When w is 
large, a Poisson rv has approximately a 
normal distribution. ] 

b. The total number of tickets given out 
during a 5-day week is between 225 and 
275? 

c. Use software to obtain the exact prob- 
abilities in (a) and (b), and compare to 
the approximations. 


Suppose the distribution of the time X (in 
hours) spent by students at a certain uni- 
versity on a particular project is gamma 
with parameters «=50 and f =2. 
Because « is large, it can be shown that 
X has approximately a normal distribution. 
Use this fact to compute the probability that 
a randomly selected student spends at most 
125 h on the project. 

The Central Limit Theorem says that X is 
approximately normal if the sample size is 
large. More specifically, the theorem states 
that the standardized X has a limiting stan- 
dard normal distribution. That is, the rv 
(X — 2)/(o/./n) has a distribution appro- 
aching the standard normal. Can you rec- 
oncile this with the Law of Large Numbers? 
Assume a sequence of independent trials, 
each with probability p of success. Use the 
Law of Large Numbers to show that 
the proportion of successes approaches p as 
the number of trials becomes large. 

Let Y,, be the largest order statistic in a sample 
of size n from the uniform distribution on 
[0, 0]. Show that Y,, converges in probability 
to 0, that is, that P(|Y, — 6] >e) 0 asn 
approaches oo. [Hint: The pdf of the largest 
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order statistic appears in Section 5.7, so the probability that total waiting time for an 
relevant probability can be obtained by entire week is at most 75 min? 
integration (Chebyshev’s inequality is not 27. It can be shown that if Y,, converges in 
relevant here).] probability to a constant t, then A(Y,,) 
26. A friend commutes by bus to and from converges to h(t) for any function A(-) that 
work 6 days/week. Suppose that waiting is continuous at t. Use this to obtain a 
time is uniformly distributed between 0 and consistent estimator for the rate parameter 1 
10 min, and that waiting times going and of an exponential distribution. [Hint: How 
returning on various days are independent does u for an exponential distribution relate 
of each other. What is the approximate to the exponential parameter 1?] 


6.3 The x7, t, and F Distributions 


The previous section explored the sampling distribution of the sample mean, X, with particular 
attention to the special case when our sample Xj,...,X, is drawn from a normally distributed 
population. In this section, we introduce three distributions closely related to the normal: the chi- 
squared (7°), t, and F distributions. These distributions will then be used in the next section to 
describe the sampling variability of several statistics on which important inferential procedures are 
based. 


The Chi-Squared Distribution 


DEFINITION For a positive integer v, let Z;,...,2Z, be iid standard normal random variables. 
Then the chi-squared distribution with v degrees of freedom (df) is defined to 
be the distribution of the rv 

X=aZ?4+...427 


This will sometimes be denoted by X ~ ye. 


Our first goal is to determine the pdf of this distribution. We start with the v = 1 case, where we may 
write X = Z?. As in previous chapters, let ®(z) and #(z) denote the cdf and pdf, respectively, of the 
standard normal distribution. Then the cdf of X, for x > 0, is given by 


F(x) = P(X <x) = P(Z} <x) = P(—Vx<Z, < Vx) = O(yx) — O(—-x) 
= O(/x) — [1 — ®(V)] = 20( Vx) — 1 


Above, we’ve used the symmetry property ®(—z) = 1 — ®(z) of the standard normal distribution. 
Differentiate to obtain the pdf for x > 0: 


1 1 1 en 1 


ee ae Lt ir, 
pe a ar aos *e*l? (6.6) 


f(x) = F'(x) = 20'(vx) - 
We have established the an pdf. But this expression looks familiar: comparing (6.6) to the gamma pdf 


in Expression (4.6), and recalling that '(!) = \/z, we find that the y7 distribution is exactly the same 
as the gamma distribution with parameters « = 1/2 and f = 2! 
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To generalize to any number of degrees of freedom, recall that the moment generating function of 
the gamma distribution is M(t) = (1 — Bt) ~*. So, the mgf of a 7? rv—that is, the mgf of Z’ when 
Z~ NO, I)—is M(t) = (1 — on > Using the definition of the chi-squared distribution and 
properties of mgfs, we find that for X ~ 72, 

Mx(t) =Mp(t) ++ My(t) = (1—21)7'? «---- (=o "== 297°" 


y] 


which we recognize as the mgf of the gamma distribution with « = v/2 and f = 2. By the uniqueness 
of mgfs, we have established the following distributional result. 


PROPOSITION The chi-squared distribution with v degrees of freedom is the gamma distri- 
bution with « = v/2 and f = 2. In particular, the pdf of the y? distribution is 


1 
ayy tt y2)-1,-x/2 
f (x; v) = SBE /2)* e x>0 


Moreover, if X ~ 7? then E(X) = v, V(X) = 2v, and Mx(t) = (1 — kes 


The mean and variance stated in the proposition follow from properties of the gamma distribution: 


= op 52 vy, @ =af? se 2v 
Figure 6.15 shows graphs of the chi-squared pdf for 1, 2, 3, and 5 degrees of freedom. Notice that the 
pdf is unbounded near x = 0 for 1 df and the pdf is exponentially decreasing for 2 df. Indeed, the chi- 
squared for 2 df is exponential with mean 2, f(x) = iar > for x > 0. If v > 2 the pdf is unimodal with 
a peak at x = vy — 2, as shown in Exercise 31. The distribution is skewed, but it becomes more 
symmetric as the number of degrees of freedom increases, and for large df values the distribution is 
approximately normal (see Exercise 29). 


f(xy) 
A 


1.05 


Figure 6.15 The chi-squared pdf for 1, 2, 3, and 5 DF 
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Without software, it is difficult to integrate a chi-squared pdf, so Table A.5 in the appendix has 
critical values for chi-squared distributions. For example, the second row of the table is for 2 df, and 
under the heading .01 the value 9.210 indicates that P(y5 > 9.210) = .01. We will use the notation 
Tis = 9.210. In general on =c means that P(x > c) =«. Instructions for chi-squared compu- 
tations using R appear at the end of this section. 

Example 6.12 The article “Reliability analysis of LED-based electronic devices” (Proc. Engr. 2012: 


260-269) uses chi-squared distributions to model the lifecycle, in thousands of hours, of certain LED 
lamps. In one particular setting, the authors suggest a parameter value of v = 8 df. Let X represent this 


7g tv. The mean and standard deviation of X are E(X) = v = 8 thousand hours and SD(X) = V2v = 
J16 = 4 thousand hours. 

We can use the gamma cdf, as illustrated in Chapter 4, to determine probabilities concerning X, 
because the 73 distribution is the same as the gamma distribution with « = 8/2 = 4 and B = 2. For 
instance, the probability an LED lamp of this type has a lifecycle between 6 and 10 thousand hours is 

P(6<X < 10) = G(10/2; 4) — G(6/2;4) = G(5; 4) — G(3;4) 
= .735 — .353 = .382 


Next, what values define the “middle 95%” of lifecycle values for these LED lamps? We desire the 
.025 and .975 quantiles of the ye distribution; from Appendix Table A.5, they are 


{oe = 2180 and’ Ying = 17554 
That is, the middle 95% of lifecycle values ranges from 2.180 to 17.534 h. ir] 


Given the definition of the chi-squared distribution, the following properties should come as no 
surprise. Proofs of both statements rely on moment generating functions (Exercises 32 and 33). 


PROPOSITION 1. If X3 = X; + X2, X, and X2 are independent, X; ~ ae and X> ~ rae then 
X3 ~ Vee 4° 
2. If X3 = X; + X2, X; and X> are independent, X; ~ oe and X3 ~~ vee with 

v3 > v,, then X, ~ — 


Statement | says that the chi-squared distribution is an additive distribution; we saw in Chapter 5 that 
the normal and Poisson distributions are also additive. Statement 2 establishes a “subtractive” 
property of chi-squared, which will be critical in the next section for establishing the sampling 
distribution of the sample variance . 
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The t Distribution 


DEFINITION Let Z be a standard normal rv and let Y be a 2 rv independent of Z. Then the 
t distribution with v degrees of freedom (df) is defined to be the distribution of 
the ratio 


We will sometimes abbreviate this distribution by T ~ t,. 


With some careful calculus, we can obtain the ¢ pdf. 


PROPOSITION The pdf of a random variable T having a ¢ distribution with v degrees of 
freedom is 


1 T((v+1)/2) 1 


f(t) = 5 00 <f<0o 


vav V(v/2) (14+ 2 /vy* 


Proof A t, variable is defined in terms of a standard normal Z and a re variable Y. They are 
independent, so their joint pdf f(y, z) is the product of their individual pdfs. We first find the cdf of 
T and then differentiate to obtain the pdf: 


oo 


Fey =r s1) = 9( Fe ct) =P(zen/Z) = jf 10 ) dzdy 


Differentiating with respect to f using the Fundamental Theorem of Calculus, 


oo ty/y/v ore 
d 6) y y 
th=—F()=]/ = dzdy = tyj/—}-4/od 
f(t) = FF) [a [ toda [ s(n) pa 
0 —0o 0 
Now substitute the joint pdf—that is, the product of the marginal pdfs of Y and Z—and integrate: 
vs a 2-1 
o-/ x PO on. Lew. Pay 
2"/2T(v/2) J2n v 


0 
oo 


1 2 
i (V+ 1)/2-1,-l[l/2+¢ /(2v)ly d 
pasate : ‘4 
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The integral can be evaluated using Expression (4.5) from the section on the gamma distribution: 


f= I Ev +1)/2) 
WPT (v/2)V2nv [1/2+2/(2v))'?*1? 
a ee : oo <t<oo 


Vm (v/2) (14 2/v) FDP? : 


The pdf has a maximum at 0 and decreases symmetrically as |¢| increases. As v becomes large, the 
t pdf approaches the standard normal pdf, as shown in Exercise 36. It makes sense that the f distri- 
bution would be close to the standard normal for large v, because T = Z//y2/v, and y2/v converges 
to 1 by the Law of Large Numbers, as shown in Exercise 30. 

Figure 6.16 shows ¢ density curves for v = 1,5, and 20 along with the standard normal (z) curve. Notice 
how fat the tails are for 1 df, as compared to the standard normal. However, as the number of df increases, 
the t pdf becomes more like the standard normal. For 20 df there is not much difference. 


fO 


Figure 6.16 Comparison of ¢ curves to the z curve 


Integration of the ¢ pdf is difficult without software, so values of upper-tail areas are given in 
Table A.7. For example, the value in the column labeled 2 and the row labeled 3.0 is .048, meaning 
that P(T > 3.0) = .048 when T ~ ft. We write this as to4g.2 = 3.0. In general we write t,, = c if 
P(T >c)=« when T ~ t,. A tabulation of these ¢ critical values (i.e., t,,,) for frequently used tail 
areas a appears in Table A.6. 

Using ['(1/2) = \/z, we obtain the pdf for the ¢ distribution with 1 df as f(t) = 1/[z(1+?P)], 
which is also known as the Cauchy distribution. This distribution has such heavy tails that the mean 
does not exist (Exercise 37). 

The mean and variance of a ¢t variable can be obtained directly from the pdf, but it’s instructive to 
derive them through the definition in terms of independent standard normal and chi-squared variables, 


T =Z/,/Y/v. Recall from Section 5.2 that E(UV) = E(U)E(V) if U and V are independent and the 
expectations of U and V both exist. Thus, 


E(T) = E(Z)E(1//¥/v) = E(Z)v'E(Y-"”?) 
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Of course, E(Z) = 0, so E(T) = 0 if E(y-'/?) exists. Let’s compute E(¥*) for any k if Y is chi-squared, 
using Expression (4.5): 


Fy (v/2)=1 7, 
E(y*) = k -y/2g a (k+v/2)—-1 ,-y/2.g 
oo) / Y DPT) Ty (v/2) J * y 
0 0 
1 T(k+v/2) 
=. Fer + y/2 for k+v/2>0 6.7 
2PT(v/2) ecu) Top re. oe 
If k + v/2 < 0, the integral does not converge and E(Y*) does not exist. When k = —}, we require 


that v > 1 for the integral to converge. Thus, the mean of a ¢ variable fails to exist if v = 1 and the 
mean is indeed 0 otherwise. 
For the variance of T we need E(T’) = E(Z’) - E[1MYA)] = 1+ vE(Y"). Using k = —1 in Expression 
(6.7), we obtain, with the help of the property [(a + 1) = al'(q), 
—IT(-14+v/2) 27! 1 1 v 


BI") = 1). wa oo MP 


provided that —1 + v/2 > 0, or v > 2. For 1 or 2 df the variance of T does not exist. For v > 2, the 
variance always exceeds 1, and for large df the variance is close to 1. This is appropriate because any 
t curve spreads out more than the z curve, but for large df the t curve approaches the z curve. 


The F Distribution 


DEFINITION Let Y, and Y> be independent chi-squared random variables with v, and vz 
degrees of freedom, respectively. The F distribution with v, numerator 
degrees of freedom and vz denominator degrees of freedom is defined to be 
the distribution of the ratio 


= Y\/vyy 
Y2/v2’ 


(6.8) 


This distribution will sometimes be denoted F,, ,,. 


The pdf of a random variable having an F distribution is 


x>0 


vty v1 v1 /2 xu /2-1 
f(%5¥1, v2) = ers arr ( ) r 


T(v,/2)P(v2/2 1+ wix/v_) 


Its derivation (Exercise 40) is similar to the derivation of the t pdf. Figure 6.17 shows the F density 
curves for several choices of v, and vz = 10. 
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Figure 6.17 F density curves 


The mean of the F distribution can be obtained with the help of Equation (6.8): E(F) = v2/(v2 — 2) 
if v> > 2, and it does not exist if v7 < 2 (Exercise 41). 

What happens to F if the degrees of freedom are large? If v2 is large, then the denominator of 
Expression (6.8) will be close to 1 (see Exercise 30), and approximately the F will be just the 
numerator chi-squared over its degrees of freedom. Similarly, if both v, and v2 are large, then both the 
numerator and denominator will be close to 1, and the F ratio therefore will be close to 1. 

Except for a few special choices of degrees of freedom, integration of the F pdf is difficult without 
software, so F critical values (values that capture specified F distribution tail areas) are given in 
Table A.8. For example, the value in the column labeled 1 and the row labeled 2 and .100 is 8.53, 
meaning that P(F\> > 8.53) = .100. We can express this as F.;},. = 8.53, where F,,,,, = c means 
that P(F,,,, > c)=«4. 

That same table can also be used to determine some lower-tail areas. Since 1/F = (X>/v2)/(X4/V,), 
the reciprocal of an F variable also has an F distribution, but with the degrees of freedom reversed, 
and this can be used to obtain lower-tail critical values. For example, .100 = P(F),2 > 8.53) = 
P/F y2 < 1/8.53) = P(F2 < .117). This can be written as F921 = .117 because .9 = P(F2; > .117). 
In general, 

1 


Poyyvy = 
Ftp. 


Finally, recalling the definition T = Z/,/X/v of a t, rv, it follows that 


= ~ oc =F, 
Xp gy 


That is, z = F,. In theory, we can use this to obtain tail areas. For example, 
.100 = P(F,2 > 8.53) = P(T; > 8.53) = P(|T| > V8.53) = 2P(T> > 2.92), 


and therefore .05 = P(T> > 2.92). We previously determined that .048 = P(T, > 3.0), which is very 
nearly the same statement. In terms of our notation, t95,2 = ,/F 10,12, and we can similarly show that 


in general ty, = ./Foiy if 0O<a<.5. 
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Chi-squared, t, and F Calculations with Software 

Although we have made several references in this section to the statistical tables in the appendix, 
software alleviates the need to rely on such tables. The commands for the cdf and quantile functions 
of the y°, t, and F distributions in R are presented in Table 6.3. 


Table 6.3 R code for chi-squared, f, and F calculations 


Chi-squared t F 
cdf pchisq (x, v) pt (x, v) pt (x, v1, V2) 
Quantile qchisq(p, v) qt (p, v) at (p, V1, V2) 


Critical values can be computed by substituting p = 1 — « into the quantile functions. For instance, 
the « = .1 critical value of the F;5 distribution, F.; ).. = 8.53, can be obtained with qf(.9, 1, 2) 
inR. 


Exercises: Section 6.3 (28-50) 35. a. Use Table A.6 to find f.005,10- 

b. Use Table A.8 to find F'91,1,10 and relate 
this to the value you obtained in part (a). 

c. Verify the answer to part (b) using 


28. a. Use Table A.5 to find ae 
b. Verify the answer to (a) by integrating 


the pdf. software. 
. Verify th ‘ t by usi 
Se VEE) Si AMBER AO: Vay SIDE 36. Show that the ¢ pdf approaches the standard 
software. 


normal pdf for large df values. [Hint: 


29. Why should 7? be approximately normal P(x+1/2)/[VxI'(x)] — 1 and (1 +a/x)* 


for large v? What theorem applies here, and = e4 asx > 00] 


why? 37. Show directly from the pdf that the mean of 
30. Apply the Law of Large Numbers to show a t, (Cauchy) random variable does not 
that 2 /v approaches 1 as v becomes large. exist. 


38. Show that the ratio of two independent stan- 
dard normal random variables has the t, dis- 
tribution. [Hint: Split the domain of the 

32. Show that if X; and X, are independent, denominator into positive and negative parts. ] 


ape 2 
Xi ~ fy,» and Xo ~ %, then Xi +%~ +n: 39. a, Use Table A.8 to find F124. 


31. Show that the 7? density function has a 
maximum at v — 2 if v > 2. 


[Hint: Use mgfs.] b. Verify the answer to part (a) using the 
33. a. Show that if X; and X> are independent, pdf. 
Xi~ oe and = X,;+X.~ rea with c. Verify the answer to part (a) using 
v3>¥,, then X,~ x). [Hint: Use software. 
mefs.] 40. Derive the F pdf by applying the method 
b. In the setting of part (a), can we allow used to derive the r pdf. 
v3 < vy? The answer is no: show thatif 41. Let X have an F distribution with v, 
X, and X, are independent, X\ ~ ae numerator df and vy denominator df. 
and X) +X. ~ 13 then v3 > v1. [Hint: a. Determine the mean value of X. 
Calculate the variance of X; + X>.] b. Determine the variance of X. 
34. a. Use Table A.6 to find 1192,1. 42. Is E(Friv) = E(x, /v1)/E(x/¥2)? Explain. 


b. Verify the answer to part (a) by inte- 
grating the pdf. 
c. Verify the answer to part (a) using 44. a. Use Table A.6 to find 5,10. 
software. b. Use (a) to find the median of the F109 
distribution. 


43. Show that Fpy... = 1/Fi-py- 
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c. Verify the answer to part (b) using f. An exponential random variable with 
software. mean 1. 

45. Show that if X has a gamma distribution g. A gamma random variable with mean | 
and c >0O is a constant, then cX has a and variance 5. [Hint: Use part (a) and 
gamma distribution. In particular, if X is Exercise 45.] 
chi-squared distributed, then cX has a 49, a. Use Exercise 29 to approximate 
gamma distribution. P(¥%) > 70), and compare the result 

46. Suppose T ~ fo. Determine the distribution with the answer given by software, 
of 1/T?. 03237. 

47. Let Z;,Zo,X1,Xo,X3 be independent rvs b. Use the formula from Table ripe 
with each Z;~N(0,1) and each X;~ 2, ¥ v(1 — 2/(9v) +224/2/(9v)) , to 
N(0,5). Construct a variable involving the approximate P(y2, > 70), and com- 
Z;'s and X;’s which has an F’3 5 distribution. pare with part (a). 

48. Let Z;, Zo, ..., Zio be independent standard 
normal. Use these to construct 50. The difference of two independent normal 


variables itself has a normal distribution. Is 
it true that the difference between two 
independent chi-squared variables has a 
chi-squared distribution? Explain. 


A ei random variable. 

. A ty random variable. 

. An F4 random variable. 

. A Cauchy random variable. 

. An exponential random variable with 
mean 2. 


onn Ot Dp 
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Let X1,...,X, be a random sample from a normally distributed population. We saw previously that 
the sampling distribution of the sample mean, X, is then also normal. In this section, we develop the 
sampling distribution of the sample variance S*, the joint distribution of X and S’, and the distri- 
butions of other important statistics when sampling from a normal distribution. The y’, ¢, and 
F distributions of Section 6.3 will feature centrally in this section, and the results established here will 
serve as the backbone for many of the statistical inference procedures in the second half of this book. 


The Joint Sampling Distribution of X and S? 


For a random sample X;,...,X,, the sample variance S* is defined as a rv by 


1 n _ 
= Docs —xy 


te Tl 


This can be used to calculate an estimate of o? when the population mean jz is unknown. This is the 
same formula presented in Section 1.4, but now we acknowledge that S”, like any statistic, will vary 
in value from sample to sample. To establish the sampling distribution of S* when sampling from a 
normal population, we first need the following critical result. 


THEOREM If X,, Xo, ..., X, form a random sample from a normal distribution, then X and S? 
are independent. 
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Proof Consider the covariances between the sample mean and the deviations from the sample mean. 
Using the linearity of the covariance operator, 


Cov(X; — X,X) = Cov(X;, X) — Cov(X, X) 
1< = 
= Cov(Xi,— > Xx) — V(X) 
n=l 


1 1 _ 
= -Cov(X;,X;) + — } > Cov(X;, Xx) — V(X) 
n n<> 
ki 
1 = 1 o 
= —V(X;) +0 — V(X) =-0? -— =0 
n n n 
The middle term in the third line is zero because independence of the X;’s implies that Cov(X;, X,) = 0 
for k # i. This shows that X is uncorrelated with all the deviations of the observations from their 
mean. In general, this would not imply independence, but in the special case of the bivariate normal 
distribution, being uncorrelated is equivalent to independence. Both X and X; — X are linear com- 
binations of the independent normal observations, so their joint distribution is bivariate normal, as 
discussed in Section 5.5. Because the sample variance S* is composed of the deviations X; — X, we 


conclude that X and S* are independent. 


To better understand the foregoing independence property, consider selecting sample after sample 
of size n from a particular population distribution, calculating x and s for each sample, and then 
plotting the resulting (x, s) pairs. Figure 6.18a shows the result for 1000 samples of size n = 5 from a 
standard normal population distribution. The elliptical pattern, with axes parallel to the coordinate 
axes, suggests no relationship between x and s, that is, independence of the statistics X and 
S (equivalently, X and S”). However, this independence fails for data from a nonnormal distribution. 
Figure 6.18b illustrates what happens for samples of size 5 from an exponential distribution with 
mean 1. This plot shows a strong relationship between the two statistics, which is what might be 
expected for data from a highly skewed distribution. 


0 ra) 1.0 1.5 2.0 2.5 3.0 


Figure 6.18 Plot of (x, s) pairs for (a) samples from a normal distribution; (b) samples from a nonnormal distribution 
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We are now ready to derive the sampling distribution of S? when sampling from a normally 
distributed population. Notice that we’ll then know the joint distribution of X and S’, since it was 
established in Section 6.2 that X ~N(,¢/./n) and we just proved that these two statistics are 
independent. 


PROPOSITION If X,, X52, ..., X, form a random sample from a M(yu, o) distribution, then 
(n—1)8?/0? ~72_,. 


Proof To begin, write 


mw = 1 . +(X- py) 
=> (%-— XP 4+28-w H+ VR wy? 


The middle term on the second line vanishes (do you see why?). Dividing through by o°, we obtain 
2 a2 y 2 2 > 2 
~ (Xi —p oO X; —X X—U _ X; —X X—U 
ede ae 


This can be re-written as 


Et) EES GA) 
= (Ce) a aia) 


If X; ~ Nw, o), then (X; — w/o is a standard normal rv. So, the left-hand side of (6.9) is the sum of 
squares of n iid standard normal rvs, which by definition has a re distribution. At the same time, the 


(6.9) 


rightmost term in (6.9) is the square of the standardized version of X. So, it’s distributed as Z* with 
Z ~ N(0, 1), which by definition is 77. And, critically, the two terms on the right-hand side of (6.9) 
are independent, because S? and X are independent. Therefore, from the “subtractive” property of the 
chi-squared distribution in Section 6.3 (with v3 =n and v; = 1), we conclude that (n — 1)S?/o* 
~ 72_,, as claimed. | 


2 


Intuitively, the degrees of freedom make sense because s~ is built from the deviations 


(x1 — X), (2 —X),...,(%, —X), which sum to zero: 


So (i — 2) = So — SOx = nx — nk =0 


The last deviation is determined by the first (n — 1) deviations, so it is reasonable that s* has only (n— 1) 
degrees of freedom. The degrees of freedom help to explain why the definition of s* has (n — 1) and not 
n in the denominator. 

Knowing that (n — 1)S?/o? ~ 72_,, it can be shown (see Exercise 52) that the expected value of - 
is o°, and also that the variance of S* approaches 0 as n becomes large. 
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A t-Distributed Statistic 

In Section 6.3, we defined the ¢ distribution as a particular ratio involving a normal rv and a chi- 
squared rv. From the definition it is not obvious how the f distribution can be applied to data, but the 
next result puts the distribution in more directly usable form. This result was originally discovered in 
1908 by William Sealy Gosset, a statistician at the Guinness Brewery in Dublin, Ireland. 


GOSSET’S THEOREM If X;, Xo, ..., X, is a random sample from a N(uw, o) distribution, then 
the rv 


X—y 


S//n 


has the ¢ distribution with (n — 1) degrees of freedom, t,,_;. 


Proof Re-express the fraction in a slightly messier way: 


X-p__ (X-w/(a/va) 


S/n In —1)82 
at feo jay 


The numerator on the right-hand side is standard normal. The denominator is the square root of a 77_, 
variable, divided by its degrees of freedom. This chi-squared variable is independent of the numerator, 
so by definition the ratio has the ¢ distribution with n — | degrees of freedom. ie 


It’s worth comparing the two rvs 


X-—u X-—u 
ojyn S/n 


When X),...,X; are iid normal rvs, then Z has a standard normal distribution. By contrast, the rv T— 
obtained by replacing o with S in the expression for Z in (6.10)—has a ¢,,_; distribution. Replacing the 
constant o with the rv S results in T having greater variability than Z, which is consistent with the 
comparison between the ¢ distributions and the standard normal distribution described in Section 6.3 
(look back at Figure 6.16). 


(6.10) 


An F-Distributed Statistic 

Suppose that we have a random sample of m observations from the normal population N(j1,, 0,) and an 
independent random sample of n observations from a second normal population N({1,, 02). Then for the 
sample variance St} from the first group we know (m — 1)S{/o7 is 72,_,, and similarly for the second 
group (n — 1)S5/a3 is y2_,. Thus, according to the definition of the F distribution given in (6.8), 


(m — 1)S}/o; 
St/oy _ m—1 
S3/03 (n—1)S3/05 


n—1 


~ Fm-1n-1 (6.11) 
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The F distribution, via Expression (6.11), will be used in Chapter 10 to compare the variances from 
two independent groups. Also, for several independent groups, in Chapter 11 we will use the F dis- 
tribution to see if the differences among sample means are bigger than would be expected by chance. 


Exercises: Section 6.4 (51-58) 


51. 


52. 


53. 


54. 


Show that when sampling from a normal 

distribution, the sample variance S? has a 

gamma _ distribution, and identify the 

parameters o and f. [Hint: See Exercise 45.] 

Knowing that (n—1)S?/o?~72_, for a 

normal random sample, 

a. Show that E(S?) = 0. 

b. Show that V(S*) = 20*/(n-1). What 
happens to this variance as n gets large? 

c. Apply Expression (6.7) to show that 


= V2T(n/2) 
Vn — 10 [(n — 1)/2] 


E(S) 


Then show that E(S) = 0\/2/nifn = 2. 
Is E(S) = o for normal data? 
Suppose X),..., X13 form a random sample 
from a normal distribution with mean 5 and 
standard deviation 8. 


a. Calculate P(X <9.13). 

P(d (x; -X)< 1187). 
[Hint: How does this relate to $2] 

c. Calculate 


P(X <9.13N Bite — X)<1187). 


d. In this context, construct a rv that has a 
t distribution, and identify its df. 


b. Calculate 


In the unusual situation that the population 
mean u is known but the population vari- 
ance o° is not, we might consider the fol- 
lowing alternative statistic for estimating 0”: 


=) - 1 


a. Show that E(G”) =o? regardless of 
whether the X;’s are normally dis- 
tributed (but still assuming they com- 
prise a random sample from some 
population distribution). 


55. 


56. 


57. 


58. 


b. Now assume the X;’s are normally dis- 
tributed. Determine a scaling constant 
c so that the rv c- 6? has a chi-squared 
distribution, and identify the number of 
degrees of freedom. 

c. Determine the variance of 6? assuming 
the X;’s are normally distributed. 

., X27 are iid N(5,4) rvs. Let 

X and S denote their sample mean and 

sample standard deviation, respectively. 

Calculate P(|X — 5] > 0.45). 


Suppose X1,.. 


It was established in this section that X and 

S° are independent rvs when sampling from 

a normal population. Is the same true for X 

and 6? = (1/n) 3+ (X; — )’, the estimator 

from Exercise 54? Let’s find out. 

a. Let X ~ N(u, 0). 

Determine Cov(X — pm, (X —m)°) and 
Cov(X,(X—)’). [Hint: Use the 
covariance shortcut formula. | 

b. Use part (a) to show that X and 6? are 
uncorrelated. Does it follow that X and 
6? are independent? 

c. The proof of the independence of X 
and §* relied critically on the fact 
that Cov(X,X;—X)=0. Calculate 
Cov(X,X; — 4). Based on this result, 


2 


does it appear that X and 6” are 


independent? 


Suppose we have a sample of size n from a 
N(, ¢) distribution. Define rvs Z and T as 
in Expression (6.10). 


a. Calculate P(—-2 <Z< 2) for n = 5, 10, 
and 15. How does sample size affect 
your answer? 

b. Calculate P(—2 <T <2) for n = 5, 10, 
and 15. How does sample size affect 
your answer? 


Suppose that we have a random sample of 
size m from a N(j;,0,) distribution and an 


6.4 Distributions Based on Normal Random Samples 


independent random sample of size n from 
a N({ly, 62) distribution. To assess whether 
the two populations have equal variances— 
a requirement of several procedures later in 
this book—we consider the ratio of the two 
sample variances: 


S2 
R=5 
2 


a. If the two populations indeed have 
equal variances—that is, if ot = o5— 
then what is the distribution of R? 

b. A common convention for accepting 
that the population variances might be 
equal is that the larger sample standard 
deviation should be no more than twice 
the smaller. Express that condition in 
terms of R. 

c. For the specific case m = 10 andn = 15, 
calculate the probability of the condition 
in part (b), assuming that the two pop- 
ulation variances are indeed equal. 

d. If the population variances really are 
equal, but the sample sizes are now 
m = 50 and n = 60, will the probability 
in part (c) be higher or lower? Why? 


Supplementary Exercises: (59-68) 


59. A small high school holds its graduation 


ceremony in the gym. Because of seating 
constraints, students are limited to a 
maximum of four tickets to graduation for 
family and friends. The vice principal 
knows that historically 30% of students 
want four tickets, 25% want three, 25% 
want two, 15% want one, and 5% want 
none. 


a. Let X = the number of tickets requested 
by a randomly selected graduating stu- 
dent, and assume the historical distri- 
bution applies to this rv. Find the mean 
and standard deviation of X. 

b. Let 7, = the total number of tickets 
requested by the 150 students graduating 


60. 


61. 


62. 


63. 
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this year. Assuming all 150 students’ 
requests are independent, determine the 
mean and standard deviation of T,. 

c. The gym can seat a maximum of 500 
guests. Calculate the (approximate) 
probability that all students’ requests 
can be accommodated. [Hint: Express 
this probability in terms of 7,. What 
distribution does T,, have?] 


Suppose that for a certain individual, calo- 
rie intake at breakfast is a random variable 
with expected value 500 and _ standard 
deviation 50, calorie intake at lunch is 
random with expected value 900 and stan- 
dard deviation 100, and calorie intake at 
dinner is a random variable with expected 
value 2000 and standard deviation 180. 
Assuming that intakes at different meals are 
independent of each other, what is the 
probability that average calorie intake per 
day over the next (365-day) year is at most 
3500? [Hint: Let X;, Y;, and Z; denote the 
three calorie intakes on day i. Then total 
intake is given by )> (X;+ Y;+ Z)).] 
Suppose the proportion of rural voters in a 
certain state who favor a particular guber- 
natorial candidate is .45 and the proportion 
of suburban and urban voters favoring the 
candidate is .60. If a sample of 200 rural 
voters and 300 urban and suburban voters 
is obtained, what is the approximate prob- 
ability that at least 250 of these voters favor 
this candidate? 


Let w denote the true pH of a chemical 
compound. A sequence of n independent 
sample pH determinations will be made. 
Suppose each sample pH is a random 
variable with expected value u and standard 
deviation. 1. How many determinations are 
required if we wish the probability that the 
sample average is within .02 of the true pH 
to be at least .95? What theorem justifies 
your probability calculation? 

A large university has 500 single employ- 
ees who are covered by its dental plan. 
Suppose the number of claims filed during 
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64. 


65. 


66. 


the next year by such an employee is a 
Poisson rv with mean value 2.3. Assuming 
that the number of claims filed by any such 
employee is independent of the number 
filed by any other employee, what is the 
approximate probability that the total 
number of claims filed is at least 1200? 


Consider independent and identically dis- 
tributed random variables Xj, X2, X3, ... 
where each X; has a discrete uniform dis- 
tribution on the integers 0, 1, 2, ..., 9; that 
is, P(X; = k) = 1/10 for k = 0, 1, 2, ..., 9. 
Now form the sum 


“1 
U;, = —— Xj 
s (10)' 
= 1X) + .01X) +--+ +(.1)"X, 

Intuitively, this is just the first n digits in the 
decimal expansion of a random number on 
the interval [0, 1]. Show that as n — oo, U,, 
converges in distribution to an rv U uni- 
formly distributed on [0, 1], i.e. that 
P(U, <u) > P(U<u), by showing that 
the moment generating function of U,, 
converges to the moment generating func- 
tion of U. 

[The argument for this appears on p. 52 of 
the article “A Few Counter Examples 
Useful in Teaching Central Limit Theo- 
rems,” The American Statistician, Feb. 


2013.] 


The Empirical Rule from Chapter 4 states 
that roughly 68% of a standard normal 
distribution is within +1 of its center, 95% 
within +2, and 99.7% within +3. 


a. For the f) distribution, determine what 
percent of the total area is within +1, 
+2, and +3 of its center. 

b. For the ¢, distribution, determine how far 
you must go out to capture 68, 95, and 
99.7% of the total area under the pdf. 


a. Show that if X~ F,,,,, then the distri- 

bution of v,-X approaches en as 
V2 — oo. [Hint: Apply Exercise 30.] 
What is the limiting distribution of X it- 
self as vj — co? 


67. 


68. 
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b. Show that if X~F,,,,, then the distri- 
bution of v2/X approaches ea as 
vy — co. What is the limiting distribu- 
tion of X itself as vj) — co? 

Suppose that we have a random sample of 

10 observations from a N(1,,0,) distribu- 

tion and an independent random sample of 

12 observations from a N({y, 02) distribu- 

tion. Let S} and $5 denote the sample 

variances of these two random samples. 


oj 
<8.12+ 
05 


b. Define a rv Gi= a en i a 


from the first random hea and define 
as similarly for the second random 


sample. Determine 


a. Determine 


o S2 
2.90 as =1 
(2907 <5 


Let X,, X, ... be a sequence of indepen- 
dent, but not necessarily identically dis- 
tributed random variables, and let 7, = 
Xi +--+ +X,. Lyapunov’s Theorem states 
that the standardized rv (7, — ur,)/or, 
converges to a N(O, 1) distribution as 
n — oo, provided that 


‘ep 
lim 22! (| L| ) 5 


n—-oo e 


where uw; = E(X;). This limit is sometimes 

referred to as the Lyapunov condition for 

convergence. 

a. Assuming E(X;) = u; and V(X;) = 07, 
write expressions for fy and a7,. 

b. Show that the Lyapunov condition is 
automatically met when the X;’s are iid. 
[Hint: Let t = E(|X; — ;|°), which we 
assume is finite, and observe that t is the 
same for every X;. Then simplify the 
limit. ] 
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c. Let X;, Xo, ... be independent random the ith question correctly is 1/i. Assume 
variables, with X; having an exponential any player’s successive answers are 
distribution with mean 7. Show that independent, and let 7, denote the 
X,+ -:- + X, has an approximately number of questions a player has right 
normal distribution as n increases. out of the first n. Show that 7, has an 

d. An online trivia game presents progres- approximately normal distribution for 
sively harder questions to players; large n. 


specifically, the probability of answering 


Appendix: Proof of the Central Limit Theorem 


First, here is a restatement of the theorem. Let X,, Xo, ..., X, be a random sample from a distribution 
with mean yu and standard deviation o. Then, if Z is a standard normal random variable, 


fai Pat <2) = P(Z<z) = @(z) 


n—-oo 


The theorem says that the distribution of the standardized X approaches the standard normal distri- 
bution. Our proof is for the special case in which the moment generating function exists, which 
implies also that all its derivatives exist and that they are continuous. We will show that the mgf of the 
standardized X approaches the mgf of the standard normal distribution. Convergence of the mgf 
implies convergence of the distribution, though we will not prove that here (the mathematics is 
beyond the scope of this book). 

To simplify the proof slightly, define new rvs by W; = (X; — w/o for i = 1, 2,..., n, the stan- 
dardized versions of the X;. Then X; = u + oW,, from which X = 4+ oW and we may write the 
standardized X expression as 


_X-#_(etoW)-e_ Gp _ ly 
ge” aye 


Let My) denote the common mef of the W;’s (since the X;’s are iid, so are the W;’s). We will obtain 
the mgf of Y in terms of Myf); we then want to show that the mgf of Y converges to the mef of a 


standard normal random variable, Mz(t) = e"/. 
From the mgf properties in Section 5.3, we have the following: 


My(t) = My, +...4w,(t/Vn) = [Mw(t/Vn)]" 


For the limit, we will use the fact that My(0) = 1, a basic property of all mgfs. And, critically, because 
the W;’s are standardized rvs, E(W;) = 0 and V(W;) = 1, from which we also have Mj, (0) = E(W) = 0 
and M".,(0) = E(W2) = V(W) + [E(W)}* = 1. 

To determine the limit as n — oo, we take a natural logarithm, make the substitution x = 1/./n, 
then apply L’H6pital’s Rule twice: 
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tim In(My(f)] = limn nfMy(t/ Vn) 


In|My(t 
= easel substitute x = 1/\/n 
x x 
M,,(tx) -t/Mw(t 
sg OE EEE) os ar ealle 
x0 2x 
tM; 
=e w (tx) 
x0 2xMy(tx) 
PM. (t 
= lim w(tx) L’Hopital’s Rule 
x0 2My (tx) + 2xtM yy (tx) 
PM\,(0) ely. 


~ 2Mw(0) +2(0)M1,(0)  2(1)+0 2 


You can verify for yourself that at each use of L’H6pital’s Rule, the preceding fraction had the 
indeterminate 0/0 form. Finally, since the logarithm function and its inverse are continuous, we may 


conclude that My(t) > e”/?, which completes the proof. a 


®) 


Check for 
updates 


Introduction 

Given a parameter of interest, such as a population mean i or population proportion p, the objective 
of point estimation is to use a sample to compute a number that represents, in some sense, a “good 
guess” for the true value of the parameter. The resulting number is called a point estimate. In 
Section 7.1, we present some general concepts of point estimation. In Section 7.2, we describe and 
illustrate two important methods for obtaining point estimates: the method of moments and the 
method of maximum likelihood. 

Obtaining a point estimate entails calculating the value of a statistic such as the sample mean X or 
sample proportion P. We should therefore be concerned that the chosen statistic utilizes all the 
relevant information available about the parameter of interest. The idea of “no information loss” is 
made precise by the concept of sufficiency, which is developed in Section 7.3. Finally, Section 7.4 
further explores the meaning of efficient estimation and properties of maximum likelihood estimators. 


7.1 Concepts and Criteria for Point Estimation 


Statistical inference is frequently directed toward drawing some type of conclusion about one or more 
parameters (population characteristics). To do so requires that an investigator obtain sample data from 
each of the populations under study. Conclusions can then be based on the computed values of 
various sample quantities. For example, let 4 (a parameter) denote the average salary of all alumni 
from a certain university. A random sample of n = 250 alumni might be chosen and the salary for 
each one determined, resulting in observed values x, x2, ..., X259. The sample mean salary X¥ could 
then be used to draw a conclusion about the value of yz. Similarly, if o is the standard deviation of the 
alumni salary distribution (population sd, another parameter), the value of the sample standard 
deviation s can be used to infer something about o. 

Recall from the previous chapter that before data is available, the sample observations are con- 
sidered random variables (rvs) X,, Xo, ..., X,. It follows that any function of the X;’s—that is, any 
statistic—such as the sample mean X or sample standard deviation S is also a random variable. That 
is, its value will generally vary from one sample to another, and before a particular sample is selected, 
there is uncertainty as to what value the statistic will assume. The same is true if available data 
consists of more than one sample. For example, we can represent the salaries of m statistics alumni 
and m computer science alumni by X;, ..., X;, and Yj, ..., Y,, respectively. The difference between the 
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two sample mean salaries is X — Y, the natural statistic for making inferences about py, — po, the 
difference between the population mean salaries. 

When discussing general concepts and methods of inference, it is convenient to have a generic 
symbol for the parameter of interest. We will use the Greek letter 0 for this purpose. 


DEFINITION A point estimate of a parameter 0 is a single number that can be regarded as a 
sensible value for 0. A point estimate is obtained by selecting a suitable statistic 
and determining its value from the given sample data. The selected statistic is 
called the point estimator of 0. 


Suppose, for example, that the parameter of interest is 1 = the true average battery life (in hours) for a 
certain type of cell phone under continuous use. A random sample of n = 3 phones might yield 
observed lifetimes x, = 5.0, x. = 6.4, x3 = 5.9. The computed value of the sample mean lifetime is 
x = 5.77, and it is reasonable to regard 5.77 h as a plausible value of , our “best guess” for the value 
of ys based on the available sample information. The point estimator used was the statistic X, and the 
point estimate of 4 was x = 5.77. If the three observed lifetimes had instead been x, = 5.6, x2 = 4.5, 
and x; = 6.1, use of the same estimator X would have resulted in a different point estimate, 
X = (5.6+4.5+4+6.1)/3 =5.40h. 

The symbol 0 (‘theta hat’) is customarily used to denote the point estimate resulting from a given 
sample; we shall also use it to denote the estimator, as an uppercase © is somewhat awkward to write. 
Thus ji = X is read as “the point estimator of yz is the sample mean X.” The statement “the point 
estimate of 4 is 5.77 h” can be written concisely as j1 = ¥ = 5.77. Notice that in writing a statement 
like 0 = 72.5, there is no indication of how this point estimate was obtained (i.e., what statistic was 
used). We recommend that both the estimator/statistic and the resulting estimate be reported. 


Example 7.1 An automobile manufacturer has developed a new type of bumper, which is supposed 
to absorb impacts with less damage than previous bumpers. The manufacturer has used this bumper in 
a sequence of 25 controlled crashes against a wall, each at 10 mph, using one of its compact car 
models. Let X = the number of crashes that result in no visible damage to the automobile (a “suc- 
cess”). The parameter to be estimated is p = the proportion of all such crashes that result in no visible 
damage; equivalently, p = P(no visible damage in a crash). If X is observed to be x = 15, the most 
reasonable estimator and estimate are 


estimator = P = 


=| 


1 
estimate = p = a x = .60 " 
n 


If for each parameter of interest there were only one reasonable point estimator, there would not be 
much to point estimation. In most problems, though, there will be more than one reasonable 
estimator. 


Example 7.2 Many communities have added fluoride to drinking water since the 1940s, but the 
solubility of sodium fluoride in particular is important to many industries. The article “A Review of 
Sodium Fluoride Solubility in Water” (J. Chem. Engr. Data 2017: 1743-1748) provides the fol- 
lowing n = 16 values for the solubility of sodium fluoride (millimoles of NaF per kilogram of HO, 
mmol/kg) at 25 °C: 
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956 974 980 980 982 983 983 985 
985 985 987 987 995 999 1000 1007 


One goal in the article was to estimate pz = the true mean solubility of NaF at 25 °C. A dotplot of 
the sample data suggests a symmetric measurement distribution, so could also represent the true 
median solubility. The given observations are assumed to be the result of a random sample X), Xo, ..., 
X 6 from this symmetric distribution. Consider the following estimators and resulting estimates for ju: 


a. Estimator = X, estimate = ¥ = )°x;/n = 15,768/16 = 985.5 mmol/kg 

b. Estimator = X, estimate = ¥ = (985 + 985) /2 = 985 mmol/kg 

c. Estimator = X, = [min(X;) + max(X;)]/2 = the midrange (i.e., the average of the two extreme 
values), estimate = [min(x;) + max(x;)]/2 = (956 + 1007)/2 = 981.5 mmol/kg 

d. Estimator = X-(6.25)> the 6.25% trimmed mean (discard the smallest and largest values of the 
sample and then average), estimate = X(6.25) = (15,768—956—1007)/14 = 986.1 mmol/kg. 


Each one of the different estimators (a)—-(d) uses a different measure of the center of the sample to 
estimate p. Which of the estimates is closest to the true value? This question cannot be answered without 
already knowing the true value. However, a question that can be addressed is, “Which estimator, when 
used on other samples of X;’s, will tend to produce estimates closest to the true value?” B 


Example 7.3 Continuing the previous example, suppose we also want to estimate the population 
variance o. A natural estimator is the sample variance: 


The corresponding point estimate is 


EA?) 2) 2 2 
ere ic =») 3 Oe za) _ (956 - 985.5)" + et (1007 — 985.5)” _ 535 gq 
az - 


A point estimate of o would then be 6 = s = V135.87 = 11.66 mmol/kg. 
An alternative estimator would result from using divisor n instead of n — 1 (Le., the average 
squared deviation): 
XxX; - xy 956 — 985.5)” + --- + (1007 — 985.5)” 
= PB eee ( ) estimate = ( et. as ) 


= 127.38 
n 16 


We will indicate shortly why many statisticians prefer S* to the estimator with divisor n. i 


Assessing Estimators: Accuracy and Precision 

When a particular statistic is selected to estimate an unknown parameter, two criteria often used to 
assess the quality of that estimator are its accuracy and its precision. Loosely speaking, an estimator is 
accurate if it has no systematic tendency to overestimate or underestimate the value of the parameter, 
across repeated values of the estimator calculated from different samples. An estimator is precise if 
those same repeated values are typically “close together,” so that two statisticians using the same 
estimator (but two different random samples) are liable to get similar point estimates. The notions of 
accuracy and precision are made more rigorous by the following definitions. 
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DEFINITION A point estimator 0 is said to be an unbiased estimator of 0 if E(0) = 0 for every 


possible value of @. The difference E (0) — (is called the bias of 0 (and equals 0 if 0 is 
unbiased). 


The standard error of @ is its standard deviation, oj=\/ V(0). If the standard 


error itself involves unknown parameters whose values can be estimated, substi- 


tution of these estimates into oj yields the estimated standard error of 0. The 
estimated standard error can be denoted by either oj or by sp. 


Unbiasedness requires that the sampling distribution of the estimator be centered at the value of 0, 
whatever that value might be. Thus if @ = 50, the mean value of the estimator must be 50, if 0 = .25 


the mean value must be .25, and so on. The bias of an estimator 0 quantifies its accuracy by 
measuring how far, on the average, 0 differs from 0. The standard error of 0 quantifies its precision by 


measuring the variability of 0 across different possible realizations (i.e., different random samples). 
Intuitively its value describes the “typical” deviation between an estimate and the mean value of the 
estimator. It is important to note that both bias and standard error are properties of an estimator (the 
random variable), such as X, and not of any specific value or estimate, X. 

Figure 7.1 illustrates bias and standard error for three potential estimators of a population 


parameter 0. Figure 7.1a shows the distribution of an estimator 0, whose expected value is very close 
to 0 but whose distribution is quite dispersed. Hence, 0, has low bias but relatively high standard 
error. In contrast, the distribution of 0> displayed in Figure 7.1b is very concentrated but is “off 
target”: the values of 02 across different random samples will systematically overestimate 0. So, 05 


has low standard error but high bias. The “ideal” estimator is illustrated in Figure 7. 1c: 0; has a mean 
roughly equal to 0, so it has low bias, and it also has a relatively small standard error. 
a b 


Y pdf of 6, 


/ Pat of 6, 


+ > 0. 
6 3 


Figure 7.1 Three potential types of estimators: (a) accurate, but not precise; (b) precise, but not accurate; 
(c) both accurate and precise | 
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It may seem as though it is necessary to know the value of 0 (in which case estimation is 


unnecessary!) to decide whether an estimator 0 is unbiased. This is not usually the case, though, as 
we'll see in the next several examples. 


Example 7.4 In Example 7.1, the sample proportion P = X/n was used as an estimator of p = the 


true proportion of successes in all possible crash tests. Because X, the number of sample successes, 
has a Bin(n, p) distribution, the mean of P is 


B(P) = £(2) = 761%) == (np) =p 


Thus P is unbiased regardless of the value of p and the sample size n. The standard error of the 


estimator is 
a= V(P) = {v@)- [eV = feno(t —p) = PCP) 


Since p is unknown (else why estimate?), we could substitute p = x/n into op, yielding the estimated 
standard error 6p = \/p(1 — p)/n. When n = 25 and p = .6, this gives Gp = ./(.6)(.4)/25 = .098. 
Alternatively, since the largest value of p(1 — p) is attained when p = .5, an upper bound on the 


standard error is \/(.5)(.5)/n = 1/(2\/n). Notice that the precision of the estimator P improves (i.e., 
its standard error decreases) as the sample size n increases. ] 


Example 7.5 In the solubility study of Example 7.2, suppose we use the estimator X to estimate pu. 
Properties of X derived in Chapter 6 include 


oO 


Jt 


where o denotes the standard deviation of the population distribution of solubility measurements 


E(X)=p and oy= 


(another parameter whose value is unknown). Thus, the sampling distribution of X is centered at u— 
i.e., X is an unbiased estimator of z—regardless of its value and the sample size n. As with the sample 
proportion, the standard error of the sample mean decreases (that is, its precision improves) with 
increasing sample size. 

Since the value of ¢ is almost always unknown, we can estimate the standard error of X by 
dy = s/,/n, where s denotes the sample standard deviation. For the 16 observations presented in 
Example 7.2, s = 11.66. The estimated standard error is then s/\/n = 11.66/16 = 2.92. This 
quantity indicates that, based on the available data, we believe our estimate of “, ¥ = 985.5 mmol/kg, 
is liable to differ by about +2.92 mmol/kg from the actual value of w. H 
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Example 7.6 Suppose that X, the reaction time (s) to a stimulus, has a uniform distribution on the 
interval from 0 to an unknown upper limit 0. An investigator wants to estimate @ on the basis of a 
random sample X;, X2, ..., X, of reaction times. Since 0 is the largest possible reaction time in the 
entire population, consider as a first estimator the largest sample reaction time: 0, = 
max(X,, ..., X,). lfm =5 and x, = 1.7, x2 = 4.2, x3 = 2.4, x4 = 3.9, x5 = 1.3, the point estimate of 0 
is 0, = max(1.7,4.2,2.4,3.9, 1.3) = 4.2s. 

For an unbiased estimator, some samples will yield estimates that exceed 0 and other samples will 
yield estimates smaller than 0—otherwise 0 could not possibly be the center of the estimator’s 


distribution. However, our proposed estimator 0, will never overestimate 0—the largest sample value 
cannot exceed the largest population value—and will underestimate 0 unless the largest sample value 
equals 0. This intuitive argument shows that 0, is a biased estimator (hence the subscript b). More 
precisely, using results on ordered values from a random sample (Section 5.7), it can be shown (see 
Exercise 62) that 


> no? 
E(0,) =——~ -:0 <9 d V(0 | 
( ») — an ( b) = (n+1)°(n+2) 


The bias of 0, is given by E(0,) — @ = n0/(n+ 1) — 0 = —0/(n +1). Because the bias is negative, 
we say that 0, is biased low, meaning that it systematically underestimates the true value of 0. 
Thankfully, the bias approaches 0 as n increases and is negligible for large n. The standard error of 0, 


can be estimated by substituting the known value of 0, for the unknown 0 in the square root of the 
variance formula above. 


It is easy to modify 0, to obtain an unbiased estimator of 0. Consider the estimator 


: i 4 1 
Bis ps 


-max(Xj,...,Xn) 
n 
Using this estimator on the data gives the estimate (6/5)(4.2) = 5.04 s. The fact that (7 + 1)/n > 1 


implies that 0, will overestimate 0 for some samples and underestimate it for others. The mean value 
of this estimator is 


‘s 1x 1 > 
E(0,) = |= iy) ="=. afd, 
n+1 n 6 

n n+1 


Thus, by definition, 0, is an unbiased estimator of 0. If 0, is used repeatedly on different samples to 
estimate 0, some estimates will be too large and others will be too small, but in the long run there will 
be no systematic tendency to underestimate or overestimate 0. a 


Mean Squared Error 
Rather than consider bias and variance (accuracy and precision) separately, another popular way to 


quantify the idea of 0 being close to 0 is to consider the squared error (@- 0). For some samples, 0 
will be quite close to @ and the resulting squared error will be very small, whereas the squared error 


will be quite large whenever a sample produces an estimate 0 that is far from the target. An omnibus 
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measure of quality is the mean squared error (expected squared error), which entails averaging the 
squared error over all possible samples and resulting estimates. 


DEFINITION The mean squared error (MSE) of an estimator @ is E[(@ — 0)”). 


For the estimators whose distributions are displayed in Figure 7.1a and b, the mean squared error is 
comparatively large, since there are many values of 0 in those two distributions that are quite some 


distance from 0. On the other hand, the estimator 0; in Figure 7.1c has much lower MSE. In fact, 
mean squared error penalizes an estimator for having either high bias (poor accuracy) or high 
variance (poor precision), as indicated by the following proposition. 


PROPOSITION For any estimator 0 ofa parameter 0, 
MSE = V(0) + [E(0) — 0]? = variance of estimator + (bias)* 


In particular, for any unbiased estimator of 0, its MSE and variance are equal. 


The proof of this result is a simple application of the variance shortcut formula and is left as an 
exercise (Exercise 23). 


Example 7.7 (Example 7.4 continued) Consider once again estimating a population proportion of 


“successes” p. We have already established that the sample proportion P = X /n is an unbiased 
estimator of p with variance equal to p(1 — p)/n. Hence, its mean squared error is 


B(P-p))] =Vv(P) +0 =P) 


Now consider the alternative estimator P = (X +2)/(n+ 4); that is, add two successes and two 
failures to the sample and then calculate the new sample proportion of successes. One intuitive 
justification for this estimator is that 


X — .5n 
n 


X —.5n 
n+4 


while 


? 


5|=| 


X+2 
| = 
n+4 | 


from which we see that P is always somewhat closer to .5 than is P. (It seems particularly reasonable 
to move the estimate toward .5 when the number of successes in the sample is close to 0 or n. For 
example, if there are no successes at all in the sample, is it sensible to estimate the population 
proportion of successes as zero, especially if n is small?) 


The bias of P is 


n+4 7 


4 X+2 E(X)+2 np +2 2/n—4p/n 
= p= 4 
nea 7 yea PP Van 


This bias is not zero unless p = .5. However, as n increases the numerator approaches zero and the 
denominator approaches 1, so the bias approaches zero. The variance of P is 
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v7) V(X+2) — V(X) _mp(l—p) _ p(i—p) 


n+4 ~ (n+4) ~ (n+4) (n+4)° ~ n+8+4+16/n 


This variance approaches zero as the sample size increases. Finally, the mean squared error of P is 


_ p(l—p)__, (2/n—4p/n\* 
ee n+8+16/n— ( 1+4/n ) 


So how does the mean squared error of the usual estimator P compare to that of the alternative 
estimator P? If one MSE were smaller than the other for all values of p, then we could say that one 
estimator is always preferred to the other (using MSE as our criterion). But as Figure 7.2 shows, this 
is not the case at least for the sample sizes n = 10 and n = 100, and in fact is not true for any other 
sample size. 


a b 
MSE MSE 
025 yo 0025 
ven ve alternative 
015 0015 ae 
010 .0010 
005 I 0005 
0 Pp 0 P 
0 2 4 6 8 1.0 (0) 2 A 6 8 1.0 
n=10 n= 100 


Figure 7.2 Graphs of MSE for the usual and alternative estimators of p 


According to Figure 7.2, the two MSEs are quite different when n is small. In this case the 
alternative estimator is better for values of p near .5 (since it moves the sample proportion toward .5) 
but not for extreme values of p. For large n, the two MSEs are quite similar, but again neither 
dominates the other. H 


Example 7.8 (Example 7.3 continued) Let’s return now to the problem of estimating population 
7 2 F 7 ‘ : 
variance o~ based on a random sample X), ..., X,, First consider the sample variance estimator 


S? = )\(X; — X)’/(n — 1). Applying the property E(Y’) = V(Y) + [E(Y)/° to the computing formula 
So (X; — X)? = 5) X? — (1/n)(32X;)° from Section 1.4 gives 
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eX @-x)] =2[¥-7 (0%) | 


1 
= no" +n —— [no + (nu) 
n 


2 2 


— np? =(n—1)o 


E[S?] = = (X; —X)’] = = ; (n-1)? =0 


= no +nye —o > 


Thus we have shown that the sample variance S° is an unbiased estimator of o° for any population 
distribution. 
The estimator from Example 7.3 that uses divisor n can be expressed as (n — 1)S*/n, and 


2 
| a =-— ' £(s?) ee 
n n n 
This estimator is therefore biased; in particular, its bias is (n — Lo*/n —o° =~-«"/n. Because the bias is 
negative, the estimator with divisor n tends to underestimate o, and this is why the divisor n — 1 is 
preferred by many statisticians (although when n is large, the bias is small and there is little difference 
between the two). 
This is not quite the whole story, however. Let’s now consider all estimators of the form 


@ =c) (X;—X) 


The expected value of such an estimator is 
E\c)- (X; — x)’] =cE bs (Xj — x)’ =c(n—1)o’ 


Clearly the only unbiased estimator of this type is the sample variance, with c = 1/(n — 1). Annoy- 
ingly, the variance of 6? depends on the underlying population distribution. So suppose the random 
sample has come from a normal distribution. Then from Section 6.4, we know that the rv (n — 1)S?/o" 
has a chi-squared distribution with n — 1 degrees of freedom. The variance of a y2_yrv is 2(n — 1), so 
the variance of the estimator is 


2 2 
Vc S% = x)’] = V|co? “aS = (co?)'v [PS =o" -2(n—1) 

Substituting these expressions into the relationship MSE = variance + (bias)”, the value of ¢ for 
which MSE is minimized turns out to be c = 1/(n + 1); see Exercise 65. So in this situation, mini- 
mizing the MSE yields a rather unnatural (and never used) estimator. 

As a final blow, even though S” is unbiased for estimating o”, it is not true that the sample standard 
deviation S is unbiased for estimating o. This is because the square root function is not linear, and the 
expected value of the square root is not the square root of the expected value. Why not find an 
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unbiased estimator for o and use it, rather than S?? Unfortunately there is no estimator of o that is 
unbiased for all possible population distributions (although in special cases, such as the normal 
distribution, an unbiased estimator can be deduced). Thankfully, the bias of S is not serious unless 7 is 
quite small, so we shall generally employ it as an estimator of o. a 


Example 7.9 (Example 7.6 continued) Consider again the two estimators 0, and 0, for the popu- 
lation maximum of a Uniform [0, 0] distribution. Using the formulas presented in Example 7.6, the 


mean squared error of A, is given by 


2 2 2 
MSE = variance + (bias)* = - + ( ) = = 
(n+ 1)°(n+2) n+l (n+ 1)(n+2) 


Since 0,, was found to be unbiased, its mean squared error is simply its variance: 


wee rin=v(0) = (2!) vn) (Se) a 


Taken together, we find that 0, has less bias than Op (obviously) but a larger variance. The use of 
mean squared error combines these two considerations, and for n > 1 it can be shown that 0, has a 


smaller MSE than 0, and is therefore the preferred estimator. a 


Unbiased Estimation 
Finding an estimator whose mean squared error is smaller than that of every other estimator for all 
values of the parameter is sometimes not feasible. One common approach is to restrict the class of 
estimators under consideration in some way, and then seek the estimator that is best in that restricted 
class. 

Statistical practitioners who buy into the Principle of Unbiased Estimation would employ an 
unbiased estimator in preference to a biased estimator, even if the latter has a smaller MSE. On this 
basis, the sample proportion of successes should be preferred to the alternative estimator of p in 


Example 7.7, and the unbiased estimator 0, should be preferred to the biased estimator 0, in Example 
7.9 (minimizing MSE would lead us to the same estimator in that instance). 

In Example 7.2, we proposed several different estimators for the mean py of a symmetric distri- 
bution. If there were a unique unbiased estimator for 1, the estimation dilemma could be resolved by 
using that estimator. Unfortunately, this is not the case. 


PROPOSITION If X, Xo, ..., X,, is a random sample from a distribution with mean ju, then X is 
an unbiased estimator of yw. If in addition the distribution is continuous and 
symmetric, then X and any trimmed mean are also unbiased estimators of . 


The fact that E(X) = pu, so X is an unbiased estimator of w, was established previously. The unbi- 
asedness of the other estimators is more difficult to verify; the argument requires invoking results on 
distributions of ordered values from Section 5.7. 

According to the preceding proposition, the Principle of Unbiased Estimation by itself does not 
always allow us to select a single estimator. When the underlying population is normal, even the third 
estimator in Example 7.2 is unbiased, and there are many other unbiased estimators. If two or more 
estimators of a parameter are unbiased, then naturally one selects the estimator among them with the 
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smallest standard error (equivalently, the least variance). The resulting 0 is called the minimum 
variance unbiased estimator (MVUE) of 0. 


Example 7.10 (Example 7.9 continued) We showed in Example 7.6 that when X;, ..., X, is a 
random sample from a uniform distribution on [0, 0], the estimator 6, = (n+ 1)/n-max(X),...,Xn) 
is unbiased for 0. However, this is not the only unbiased estimator of 0. The expected value of a 
uniformly distributed rv is just the midpoint of the support, so here E(X;) = 0/2. This implies that 
E(X) = 0/2, from which E(2X) = 0. That is, the estimator 0. = 2X is also unbiased for 0. 

If X is uniformly distributed on the interval [0, 0], then from Chapter 4 we have V(X) = a = 
(0 — 0)°/12 = 07/12. Hence, the variance (and MSE) of @, are 


V(02) = V(2X) = 4V(X) 4.2 4 2/2 _ 


n 3n 


For n> 1, V(0>) will be greater than V(0,), so 0, is a better estimator than 0>. More advanced 
methods can be used to show that 0, is the MVUE of 0—that is, every other unbiased estimator of 


0 has variance that exceeds the variance of 0,. | 


One of the triumphs of mathematical statistics has been the development of methodology for iden- 
tifying the MVUE in a wide variety of situations. The most important result of this type for our 
purposes concerns estimating the mean yp of a normal distribution. 


THEOREM Let X;, ..., X,, be arandom sample from a normal distribution with parameters jz and o. 
Then the estimator fi = X is the MVUE for p. 


Whenever we are convinced that the population being sampled is normal, the result says that X should 
be used to estimate yw. For a proof in the special case that o is known, see Exercise 55. 

Again, in some situations it is possible to obtain an estimator with small bias that would be 
preferred to the best unbiased estimator. This is illustrated in Figure 7.3. However, MVUESs are often 
easier to obtain than the type of biased estimator whose distribution is pictured. 


pdf of 6, a biased estimator 


[\ pdf of 65, the MVUE 
1 ra 

| 

6 


Figure 7.3 A biased estimator that is preferable to the MVUE 


Consistency 

As a researcher’s sample size increases and thus more of the population is observed, any reasonable 
estimator should, in some sense, “converge to” the parameter it is estimating. For instance, the Law of 
Large Numbers (Section 6.2) states that the sample mean X of a random sample converges to the 
theoretical mean yw in a specific mathematical sense as n — oo. This intuitive notion is called 
consistency. 
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DEFINITION Let X;,...,X, be arandom sample from a distribution that depends on a parameter 0. 


Then an estimator @ is a consistent estimator of 0 if 0 converges to 0 as n — oo, 
either in the sense that 


1. E[(0 — 0)"] - 0 as n> 00, or 
2. P(\{@— | >2) > 0 as n— oo for every e>0.! 


Statement | in the definition requires that the mean squared error of 0 converge to 0 as the sample size 
increases to infinity; this is known formally as convergence in mean square or convergence in 
quadratic mean. Statement 2 says that 0 converges to 0 in probability. Intuitively, this means that the 
chance of an estimate differing from the value of 0 by any small amount approaches 0 as the sample 
size increases. Other examples of consistent estimators include the sample proportion P as an esti- 
mator of a population proportion p and sample standard deviation S' as an estimator of population 
standard deviation o. All estimators introduced in subsequent chapters are consistent. 


Some Complications 
Although it was stated previously that X is the MVUE for a population mean when sampling from a 
normal distribution, that does not mean X should be used irrespective of the distribution being sampled. 


Example 7.11 Suppose we wish to estimate the number of calories 0 in a certain food. Using standard 
measurement techniques, we will obtain a random sample X1,...,X, of n calorie measurements. 
Imagine that the population distribution is a member of one of the following three families: 


f(x) = > =e O00) i Sa ates (7.1) 
V 270° 
1 
f(x) = lt @ 6) oo <x < 00 (7.2) 
fap 0—c<x<0+e (73) 


The pdf (7.1) is the normal distribution, (7.2) is called the Cauchy distribution, and (7.3) is a uniform 
distribution. All three distributions are symmetric about 0, which is therefore the median of each 
distribution. (The value @ is also the mean for the normal and uniform distributions, but the mean of 
the Cauchy distribution fails to exist.) 

Consider the four estimators proposed in Example 7.2: X, X, X, (the average of the two extreme 


observations), and X;, (a trimmed mean). The best estimator for 0 depends crucially on which 
distribution is being sampled. In particular, 


1. If the random sample comes from a normal distribution, then X is the best of the four esti- 
mators, since it has minimum variance among all unbiased estimators. 

2. If the random sample comes from a Cauchy distribution, then X and X, are terrible estimators 
for 0, whereas X is quite good (the MVUE is not known). X and X, are bad because they are 
very sensitive to outlying observations, and the heavy tails of the Cauchy distribution make a 
few such observations likely to appear in any sample. 


‘In fact, Statement 1 implies Statement 2. But there exist unusual cases for which Statement 1 fails—typically, when the 
variance of the estimator is infinite—and Statement 2 still holds. 
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3. If the underlying distribution is the particular uniform distribution in (7.3), then the best 
estimator is X,; in general, this estimator is greatly influenced by outlying observations, but 
here the lack of tails makes such observations impossible. 

4. The trimmed mean is best in none of these three situations but works reasonably well in all 
three. That is, X,; does not suffer too much in comparison with the best procedure in any of the 
three situations. 


More generally, research over the past several decades has established that when estimating a point 
of symmetry of a continuous probability distribution, a trimmed mean with trimming proportion 
between 10 and 20% from each end of the sample produces reasonably behaved estimates over a very 
wide range of possible population models. For this reason, a trimmed mean with small trimming 
percentage is said to be a robust estimator. H 


Example 7.12 Suppose a type of component has a lifetime distribution that is exponential with 
parameter A, so that expected lifetime is w = 1/A. A sample of n such components is selected, and 
each is put into operation. If the experiment is continued until all n lifetimes Xj, ..., X,, have been 
observed, then X is an unbiased estimator of pu. 

In some experiments, though, the components are left in operation only until the time of the rth 
failure, where r < n. This procedure is referred to as censoring. Let Y, denote the time of the first 
failure (the minimum lifetime among the n components), Y> denote the time at which the second 
failure occurs (the second smallest lifetime), and so on. Since the experiment terminates at time Y,, the 
total accumulated lifetime at termination is 


: 
T,= S$ ¥;+(n—ny, 
i=1 


We now demonstrate that j: = 7T,/r is an unbiased estimator for u. To do so, we need two properties 
of exponential variables: 


1. The memoryless property (see Section 4.4) says that at any time point, remaining lifetime has 
the same exponential distribution as original lifetime. 
2. If X,, ..., X, are independent exponential rvs with parameter 2, then min(X;, ..., X,) is 
exponential with parameter kA and has expected value 1/(kA). See Example 5.39. 
Since all n components last until Y,, n — 1 last an additional Y> — Y,, n — 2 an additional Y3 — Y> 
amount of time, and so on, another expression for T, is 


T, = nY, + (n—1)(¥2 — 1) + (n — 2)(¥3 — Yo) +--+ +(n—r4+1)(¥%, - Y,-1) 


But Y, is the minimum of n exponential variables, so E(Y,) = 1/(n4). Similarly, Y2— Y, is the smallest 
of the n — 1 remaining lifetimes, each exponential with parameter 1 (by the memoryless property), so 
E(¥, — Y,) = W[(m — 1)4]. Continuing, E(Y;., — Y;) = 1/[(n — i)A], so 


E(T,) = nE(¥:) +(n— 1)E(¥ —¥:) + - t(n—r + DE(Y, — Yy-1) 
1 1 1 r 


eS aun sa sy as tl) rh | 


Therefore, E(T,/r) = (/r)E(T,) = (1/r) - (7/2) = 1/2 = was claimed. 
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As an example, suppose 20 components are tested and r = 10. Then if the first ten failure times are 
11, 15, 29, 33, 35, 40, 47, 55, 58, and 72, the point estimate of p is 


114+ 15+ --- +72+(10)(72) 


= 111. 
10 : 


h= 


The advantage of the experiment with censoring is that it terminates more quickly than the uncen- 
sored experiment. However, it can be shown that V(T,/r) = Wr), which is larger than Wn), the 
variance of X in the uncensored experiment. i 


The form of an estimator 0 may be sufficiently complicated so that standard statistical theory 
cannot be applied to obtain a formula for its standard error. If we assume the population has a certain 
distribution f(x; 0), then we can use software to simulate repeated samples from that distribution, 
calculate the value of 0 for each sample, and use the standard deviation of these various @ values to 
estimate oj. Of course, software packages cannot perform such a simulation without the user spec- 
ifying a numerical value of 0 in advance, and the value of @ is unknown for our data. In many 
simulation studies, the researcher will therefore perform this process for a variety of @ values, each 
one returning a different estimated standard error of 0. 


On other occasions, sample data is available from which a point estimate 0 has been obtained, so 
we have an estimate of 0 but no measure of the uncertainty in that estimate. In that scenario, repeated 
values from the pdf f(x; 0)—that is, the pdf specified by plugging in 0 = 0—are simulated, and the 
estimated standard error is obtained as before. This procedure is known as the parametric bootstrap; 
we will consider bootstrap methods in greater depth in subsequent chapters. 


Exercises: Section 7.1 (1-24) 


1. The accompanying data on IQ for first graders 
at a university laboratory school was intro- 
duced in Exercise 81 of Chapter 1. 


100. [Hint: Think of an observation as a 
“success” if it exceeds 100.] 
e. Calculate a point estimate of the popula- 


82 96 99 102 103 103 106 107 108 108 108 tion coefficient of variation o/p, and state 


108 109 110 110 111 113 113 113 113 115 115 which estimator you used. 


118 118 119 121 122 122 127 132 136 140 146 2A sample of 20 students who had recently 


a. Calculate a point estimate of the mean value taken introductory statistics yielded the fol- 
of IQ for the conceptual population of all lowing information on brand of calculator 
first graders in this school, and state which owned (T = Texas Instruments, H = Hewlett- 
estimator you used. [Hint: )~ x; = 3753.] Packard, C = Casio, S = Sharp): 

b. Calculate a point estimate of the IQ value 
that separates the lowest 50% of all such : : : a s E = 4 
students from the highest 50%, and state 
which estimator you used. a. Estimate the true proportion of all such 

c. Calculate and interpret a point estimate of students who own a Texas Instruments 
the population standard deviation o. calculator. 

Which estimator did you use? [Hint: b. Of the ten students who owned a TI cal- 
> . = 432,015.] culator, 4 had graphing calculators. Esti- 

d. Calculate a point estimate of the propor- mate the proportion of students who do 


tion of all such students whose IQ exceeds not own a TI graphing calculator. 
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3. Consider the following sample of observa- 
tions on coating thickness for low-viscosity 
paint (“Achieving a Target Value for a 
Manufacturing Process: A Case Study,” 


J. 


83 
1.48 


Qual. Technol. 1992: 22-26): 


88 88 1.04 1.09 1.12 1.29 1.31 
149 159 1.62 1.65 1.71 1.76 = 1.83 


Assume that the distribution of coating 
thickness is normal (a normal probability 
plot strongly supports this assumption). 


a. 


Calculate a point estimate of the mean 
value of coating thickness, and state which 
estimator you used. 

. What is the estimated standard error of the 
estimator that you used in part (a)? 
Calculate a point estimate of the median of 
the coating thickness distribution, and 
state which estimator you used. 

Calculate a point estimate of the value that 
separates the largest 10% of all values in 
the thickness distribution from the 
remaining 90%, and state which estimator 
you used. [Hint: Express what you are 
trying to estimate in terms of yw and o.] 
Estimate P(X < 1.5), i.e., the proportion of 
all thickness values less than 1.5. [Hint: If 
you knew the values of jz and o, you could 
calculate this probability. These values are 
not available, but they can be estimated.] 


4. The data set mentioned in Exercise 1 also 
includes these third grade verbal IQ obser- 
vations for males and females, respectively. 


Male: 
117 
149 


S 
103. 121) «6112, 120) :132)s 113s 117s 132 
125. 131 «136 «©6107 108 $113 136) «6114 


Females 


114 
114 


102.) 113) 131) 0124 ss 117,s«120°s: 90 
109 102) «114 127) 127 103 


Prior to obtaining data, denote the male 
values by X, ..., X,, and the female values 
by Y;, ..., Y,,. Suppose that the X;,’s consti- 
tute a random sample from a distribution 
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with mean 1, and standard deviation o, and 
that the Y;’s form a random sample (inde- 
pendent of the X;’s) from another distribution 
with mean slo and standard deviation o>. 


a. Use rules of expected value to show that 
X — Y is an unbiased estimator of py, — Mo. 
Calculate the estimate for the given data. 

b. Use rules of variance from Chapter 5 to 
obtain expressions for the variance and 
standard deviation (standard error) of the 
estimator in part (a), and then compute the 
estimated standard error. 

c. Calculate a point estimate of the ratio 
0,/o2 of the two standard deviations. 

d. Suppose one male third grader and one 
female third grader are randomly selected. 
Calculate a point estimate of the standard 
deviation of the difference X — Y between 
male and female IQ. 


5. As an example of a situation in which several 


different statistics could reasonably be used to 
calculate a point estimate, consider a popu- 
lation of N invoices. Associated with each 
invoice is its “book value,” the recorded 
amount of that invoice. Let T denote the total 
book value, a known amount. Some of these 
book values are erroneous. An audit will be 
carried out by randomly selecting n invoices 
and determining the audited (correct) value 
for each one. Suppose that the sample gives 
the following results (in dollars). 


Invoice 
1 2 3 4 5 
Book value 300 720 526 200 127 
Audited value 300 520 526 200 157 
Error 0 200 0 0 —30 


Let X =the sample mean audited value, 
Y =the sample mean book value, and D = 
the sample mean error. Propose three differ- 
ent statistics for estimating the total audited 
(i.e., correct) value @—one involving just 
N and X, another involving N, T, and D, and 
the last involving T and X/Y. Then calculate 
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the resulting estimates when N = 5000 and 
T = 1,761,300. [The article “Statistical 
Models and Analysis in Auditing,” Stat. Sci. 
1989: 2-33 discusses properties of these 
estimators. | 


. Consider the accompanying data on cycles to 
failure for a sample of 12 turbine blades 
(“Effect of Aluminized Coating on Combined 
Low and High Cycle Fatigue Life of Turbine 
Blade at Elevated Temperature,” J. Engr. Gas 
Turbines Power 2019): 


209 226 281 494 568 953 
488 655 943 973 1193 1358 


The article’s authors used an appropriate 
probability plot to support the use of a log- 
normal distribution (see Section 4.5) as a 
model for cycles to failure. 


a. Estimate the parameters of the distribu- 
tion. [Hint: Remember that X has a log- 
normal distribution with parameters yu and 
o if In(X) is normally distributed with 
mean yp and standard deviation c.] 

b. Use the estimates of part (a) to estimate 
the true mean cycles to failure for this 
type of turbine blade. [Hint: What is E(X) 
for the lognormal distribution? ] 


a. A random sample of 10 houses in a par- 
ticular area, each of which is heated with 
natural gas, is selected and the amount of 
gas (therms) used during the month of 
January is determined for each house. The 
resulting observations are 103, 156, 118, 
89, 125, 147, 122, 109, 138, 99. Let wu 
denote the average gas usage during 
January by all houses in this area. Com- 
pute a point estimate of y. 

b. Suppose there are 10,000 houses in this 
area that use natural gas for heating. Let t 
denote the total amount of gas used by all 
of these houses during January. Estimate 
t using the data of part (a). What 
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estimator did you use in computing your 
estimate? 

c. Use the data in part (a) to estimate p, the 
proportion of all houses that used at least 
100 therms. 

d. Give a point estimate of the population 
median usage (the middle value in the 
population of all houses) based on the 
sample of part (a). What estimator did you 
use? 


. In a random sample of 80 components of a 


certain type, 12 are found to be defective. 


a. Give a point estimate of the proportion of 
all such components that are not defective. 
b. A system is to be constructed by randomly 
selecting two of these components and 
connecting them in series, as shown here. 


The series connection implies that the sys- 
tem will function if and only if neither 
component is defective (i.e., both compo- 
nents work properly). Estimate the propor- 
tion of all such systems that work 
properly. [Hint: If p denotes the probability 
that a component works properly, express 
P(system works) in terms of p.] 

c. Let P be the sample proportion of suc- 
cesses. Is P? an unbiased estimator for p~? 
[Hint: Recall that for any rv Y, E(Y’) 
= VY) + [EI 


. Each of 150 newly manufactured items is 


examined and the number of scratches 
per item is recorded (the items are supposed 
to be free of scratches), yielding the following 
data: 


Number of scratches 0 lL 2 3 4 3 6 7 
per item 
Observed frequency 


18 37 42 30 13 7 2 1 
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Let X = the number of scratches on a ran- 
domly chosen item, and assume that X has a 
Poisson distribution with parameter yw. 


a. Find an unbiased estimator of mw and 
compute the estimate for the data. 
b. What is the standard deviation (standard 


error) of your estimator? Compute the 


2 


estimated standard error. [Hint: oy = uw 


when X is Poisson.] 


Using a long rod that has length yu, you are 
going to lay out a square plot in which the 
length of each side is yw. Thus the area of the 
plot will be »?. However, you do not know 
the value of 4, so you decide to make n in- 
dependent measurements Xj, X, ... X,, of 
the length. Assume that each X; has mean yw 
(unbiased measurements) and variance a. 


a. Show that x is not an unbiased estimator 
for ic. [Hint: For any rv Y, E(Y’) — 
V(Y) + [E(Y)]°. Apply this with Y = X.] 

b. For what value of k is the estimator 
X —kS? unbiased for iC? [Hint: Com- 
pute E(X” — kS?).] 


Let X, (X>) denote the number of male (fe- 
male) teenagers in a random sample of size 
n, (Nz) who have vaped during the previous 
12 months. Denote the probabilities that a 
randomly selected teenage male and female 
vaped in the last 12 months by p, and po, 
respectively. Define P; = X;/n; for i = 1, 2. 


a. Show that Pi — P, is an unbiased esti- 
mator for p, — po. [Hint: E(X;) = np; for 
i =1,2.] 

b. What is the standard error of the esti- 
mator in part (a)? 

c. How would you use the observed values 
xX, and x, to estimate the standard error of 
your estimator? 

d. If ny = ny = 200, x, = 107, and x2 = 62, 
use the estimator of part (a) to obtain an 
estimate of p; — po. 

e. Use the result of part (c) and the data of 
part (d) to estimate the standard error of 
the estimator. 


12. 


13. 


14. 
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Suppose a certain type of fertilizer has an 
expected yield per acre of 4, with variance 
a, whereas the expected yield for a second 
type of fertilizer is 1. with the same variance 
oa. Let Si and S3 denote the sample vari- 
ances of yields based on sample sizes n, and 
No, respectively, of the two fertilizers. Show 
that the following pooled (combined) esti- 


mator is unbiased for estimating 07: 


52 _ mM — 1)Si + (m = 1)S5 
ny +n —2 


The time a customer spends in service after 
waiting in a queue is often modeled with an 
exponential distribution. Let X1,...,X, be a 
random sample of service times. Since the 
parameter 4 of the exponential distribution 
is the reciprocal of the expected value, a 


reasonable estimator of 2 is 1 = 1/X. 
a. Show using a moment generating func- 


tion argument that X has a gamma dis- 
tribution, with parameters «=n and 


B = 1/(nd). 


b. Show that the mean and variance of the 


estimator 2 are 


E(A) = call and 
ay ne}? 
ee Gai =D 


[Hint: Determine E(1/Y) and E(1/Y’) 
when Y has a gamma distribution, using 
the gamma pdf and Expression (4.5). For 
the variance, use the variance shortcut 
formula. ] 

c. Propose a formula for the estimated 
standard error of A. 


Refer back to the previous exercise. Con- 
sider the following alternative estimator of 
the parameter /: 

2 n—1_ n—-11_ n-1; 


a — — — = 2d 
SOX; n xX n 
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15. 


16. 


17. 


a. Determine the mean, variance, and MSE 
of Aa. [Hint: Use rescaling properties.] 


b. Which of the two estimators, i or has is 
preferable? Explain your reasoning. 


Consider a random sample X), ..., X,, from 
the pdf 
f(x; 0) = 5114+ 6x) -1<x<l 


for some —1 <0 <1 (this distribution arises 
in particle physics). Show that 0 = 3X is an 
unbiased estimator of 0. [Hint: First deter- 
mine up = E(X) = E(X).] 


A sample of n captured jet fighters results in 
serial numbers x1, X2, X3, ..., X,», The CIA 
knows that the aircraft were numbered 
consecutively at the factory starting with « 
and ending with f, so that the total number 
of planes manufactured is 6 — « + 1 (e.g., if 

a = 17 and f = 29, then 29 — 17 + 1 = 13 

planes having serial numbers 17, 18, 19, ..., 

28, 29 were manufactured). However, the 

CIA does not know the values of « or f. 

A CIA statistician suggests using the esti- 

mator max(X;) — min(X;,) + 1 to estimate the 

total number of planes manufactured. 

a. If n = 5, x, = 237, x2 = 375, x3 = 202, 
X4 = 525, and x5 = 418, what is the 
corresponding estimate? 

b. Under what conditions on the sample 
will the value of the estimate be exactly 
equal to the true total number of planes? 
Will the estimate ever be larger than the 
true total? Do you think the estimator is 
unbiased for estimating B — «+1? 
Explain in one or two sentences. 

(A similar method was used to estimate 

German tank production in World War IL.) 


Let X,, Xz, ..., X, represent a random sam- 
ple from a Rayleigh distribution with pdf 


f= oo x>0 


18. 
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a. It can be shown that E(X°) = 20. Use this 
fact to construct an unbiased estimator of 
0 based on >>X? (and use rules of 
expected value to show that it is 
unbiased). 

b. Estimate 6 from the following measure- 
ments of blood plasma beta concentra- 
tion (in pmol/L) for n= 10 men, 
assuming the population of measure- 
ments follows a Rayleigh distribution. 


16.88 
14.23 


10.23, 4.59 6.66 
19.87 9.40 6.51 


13.68 
10.95 


Suppose the true average growth yw of one 
type of plant during a l-year period is 
identical to that of a second type, but the 
variance of growth for the first type is a, 
whereas for the second type, the variance is 
4o?. Let X1, ..., Xm be m independent 
growth observations on the first type [so 
E(X,) = pw, V(X;) = o°], and let Yj, ..., Y,, be 
n independent observations on the second 
type [E(Y) = n, V(Y) = 407]. Let c be a 
numerical constant and consider the esti- 
mator ji=cX+(1—c)Y; for any c be- 
tween 0 and 1, this is a weighted average of 
the two sample means. 


a. Show that for any c the estimator is 
unbiased. 

b. For fixed m and n, what value c mini- 
mizes V(jt)? [Hint: The estimator is a 
linear combination of the two sample 
means and these means are independent. 
Once you have an expression for the 
variance, differentiate with respect to c.] 


. In Chapter 3, we defined a negative bino- 


mial rv as the number of trials required to 
achieve the rth success in a sequence of 
independent and identical success/failure 


trials. The probability mass function 
(pmf) of X is 
Xt — r x—r 
nb(x,r,p) = p'(1—p) 
r-1 


x=r,r+1,r+2,... 
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Concepts and Criteria for Point Estimation 
a. Suppose that r > 2. Show that 
P=(r—1)/(X-1) 


is an unbiased estimator for p. [Hint: 
Write out E(P) as a sum, then make the 
substitutions y = x — | and s=r-—1.] 

b. A reporter wishing to interview five indi- 
viduals who support a certain candidate 
begins asking people whether (S) or not 
(F) they support the candidate. If the 
sequence of responses is SFFSFFFSSS, 
estimate p = the true proportion who 
support the candidate. 


., X, be a random sample from 
a pdf f(x) that is symmetric about p, so that 
X is an unbiased estimator of y. If n is large, 
it can be shown that V(X)  1/{4n[f(u)]7}. 
When the underlying pdf is Cauchy (see 
Example 7.11), V(X) = 00, so X is a terrible 
estimator. What is V(X) in this case when 
n is large? 


An investigator wishes to estimate the pro- 
portion of students at a certain university 
who have violated the honor code. Having 
obtained a random sample of n students, she 
realizes that asking each, “Have you vio- 
lated the honor code?” will probably result 
in some untruthful responses. Consider the 
following scheme, called a randomized 
response technique. The investigator makes 
up a deck of 100 cards, of which 50 are of 
type I and 50 are of type II. 


Type I: Have you violated the honor code 
(yes or no)? 

Type II: Is the last digit of your telephone 
number a 0, 1, or 2 (yes or no)? 


Each student in the random sample is asked 
to mix the deck, draw a card, and answer the 
resulting question truthfully. Because of the 
irrelevant question on type II cards, a yes 
response no longer  stigmatizes the 
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respondent, so we assume that responses are 
truthful. Let p denote the proportion of 
honor-code violators (i.e., the probability of 
a randomly selected student being a viola- 
tor), and let 2 = P(yes response). Then 1 
and p are related by 2 = .5p + (.5)(.3). 


a. Let Y denote the number of yes respon- 
ses, so Y ~ Bin(n, A). Thus Y/n is an 
unbiased estimator of 4. Derive an esti- 
mator for p based on Y. If n = 80 and 
y = 20, what is your estimate? [Hint: 
Solve 2 = .5p + .15 for p and then sub- 
stitute Y/n for /.] 

b. Use the fact that E(Y/n) = 1 to show that 
your estimator p is unbiased. 

c. Ifthere were 70 type I and 30 type II cards, 
what would be your estimator for p? 


Return to the problem of estimating the 
population proportion p and_ consider 
another adjusted estimator, namely 


~ X+ /n/4 
P= —_* 
n+/n 


(The justification for this estimator comes 
from the Bayesian approach to point esti- 
mation.) 

a. Determine the mean squared error of this 
estimator. What is interesting about this 
MSE? 

b. Compare the MSE of this estimator to 
the MSE of the usual estimator (the 
sample proportion). 

Show that MSE(0) = E[(0— 0)"] = 

V(0) + [E(0) — 0}? (the mean squared error 

proposition from earlier in this section). 


[Hint: Write u for E(0). Expand the two 
quadratic expressions, and use the variance 


shortcut formula to rewrite V(0).] 
Show that 0 is a consistent estimator of 0 din 
the mean-square sense) if and only if both 


(1) E(0) — 0 and (2) V(0) + 0 asn— co. 
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7.2. The Methods of Moments and Maximum Likelihood 


The point estimators introduced in Section 7.1 were obtained via common sense and/or educated 
guesswork. We now introduce two “constructive” methods for obtaining point estimators: the method 
of moments and the method of maximum likelihood. By constructive we mean that the general 
definition of each type of estimator suggests explicitly how to obtain the estimator in any specific 
problem. Although maximum likelihood estimators are generally preferable to moment estimators 
because of certain efficiency properties, they often require significantly more computation than do 
moment estimators. (It is sometimes the case the two methods produce the same estimator.) 


The Method of Moments 
The basic idea of this method is to equate certain sample characteristics, such as the sample mean, to 
the corresponding population expected values. Then solving these equations for unknown parameter 
values yields the estimators. 


DEFINITION Let X;, ..., X,, be a random sample from some distribution. For k = 1, 2, 3, ..., 
the kth population moment, or kth moment of the distribution, is E(X*). The 
kth sample moment is (1/7) )~7_, X¢. 


Thus the first population moment is E(X) = j and the first sample moment is + X;/n = X. The 
second population and sample moments are E(X°) and YX a; respectively. The population 
moments will be functions of any unknown parameters 01, 02, ... . 


DEFINITION Let Xj, Xo, ..., X,, be a random sample from a distribution depending on parameters 
01, ..., 0,, whose values are unknown. Then the method of moments estimators 
(mmes) B15 2+ +5 Om are obtained by equating the first m sample moments to the 
corresponding first m population moments and solving for 0), ..., Oy. 


If, for example, m = 2, E(X) and E(X’) will be functions of 0, and 05. Setting E(X) = )>X;/n =X 
and E(X*) = )> X?/n gives two equations in 0; and 05. The solution then defines the estimators. 


Example 7.13 Let X,, ..., X, represent a random sample of service times of n customers at a certain 
facility, where the underlying distribution is assumed exponential with parameter 1. Since there is 
only one parameter to be estimated, the estimator is obtained by equating E(X) to X. Since E(X) = 1// 
for an exponential distribution, this gives 1/2 = X or 2 = 1/X. The mme of 2 is then 4 = 1/X. 


Example 7.14 Let X;, ..., X, be a random sample from a gamma distribution with parameters « and 


B. From Section 4.4, E(X) = af and E(X’) = BT (a + 2)/T (a) = Bo + 1)a. The mmes of « and f are 
obtained by solving 


X = of 2 Sox = a(t 1)? 
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A little straightforward algebra gives the estimators 


gg _FE xP - 
1X? — (X)" a 


To illustrate, the article cited in Example 4.29 recommends using the gamma distribution to model 
times between attempted connections to a server from suspicious IP addresses. The article includes 
the following n = 31 observations for such interarrival times (hours) for one particular server being 
“hit” by a specific suspicious IP address: 


2.3403 8.0347 8.4395 17.3053 2.9156 10.1836 2.1481 4.0839 2.3567 6.0122 
1.0270 0.1208 12.9981 14.9370 3.7714 1.3228 0.3270 9.9028 3.4356 4.0326 
3.0470 1.3922 0.3828 0.6180 4.0120 4.4803 8.6706 0.2933 2.9467 17.3828 
0.9431 


from which x = 5.157 and (1/31) }> x? = 51.168. The parameter estimates are 


5.157)" ~ 51.168 — (5.157) 
( ) >= 1.082 p= ( ) 
51.168 — (5.157) 5.157 


“a= 


= 4.765 


These estimates of « and f fall into the range of parameter estimates suggested by the article’s 
authors. (We’ll consider interval estimates of parameters in Chapter 8.) | 


Example 7.15 Let X;, ..., X, be a random sample from the following discrete distribution: 


xtr—-1 Z i 
pts) = ( r—1 )r'a-p) x=0,1,2,... 

This is a variant on the generalized negative binomial distribution with parameters r and p (see 

Chapter 3, Exercise 124). It can be shown for this distribution that E(X) = r(1 — p)/p and V(X) = 

r(1 — p/p”, from which E(X’) = V(X) + [E(X)P = rl — pr — rp + 1)/p”. Equating E(X) to X and 

E(X’) to (1/n) 3> X? eventually gives 


As an illustration, the article “Chains of Transmission and Control of Ebola Virus Disease in Con- 
akry, Guinea, in 2014: an Observational Study” (Lancet Infect. Dis. 2015: 320-326) describes a study 
of the number of secondary Ebola cases stemming from 152 infected people (a secondary case means 
they give someone else the disease). The data is as follows: 


Number of cases 0 1 2 3 4 
Frequency 109 16 9 5 5 2 1 3 1 1 


n 
ol 
Ne) 
a 
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A follow-up letter in the same journal investigated modeling these counts with a generalized 
negative binomial distribution. First, 


¥ = )\ x;/152 = [0(109) + 1(16) + --- + 17(1)]/152 = 0.954 
and 

S © x7 /152 = [0°(109) + 17(16) + --- + 17°(1)]/152 = 6.704 
Thus, the mmes for p and r in this case are 


0.954 0.954)? 
p = ————__, = 165 ?= ( y = .188 
6.704 — (0.954) 6.704 — (0.954)* — 0.954 


Although r by definition must be positive, the denominator of 7 could potentially turn out negative, 
which would indicate that the generalized negative binomial distribution is not appropriate (or that the 
moment estimator is flawed). ie 


Maximum Likelihood Estimation 

The method of maximum likelihood was first introduced by R. A. Fisher, a geneticist and statistician, 
in the 1920s. Most statisticians recommend this method, at least when the sample size is large, since 
the resulting estimators have certain desirable efficiency properties (see the proposition on large 
sample behavior toward the end of this section, as well as Section 7.4). The following example 
illustrates the key underlying concept. 


Example 7.16 A May 2018 article on www.howtogeek.com discusses traditional criteria for 
“strong” passwords and the emerging advice to use longer passphrases by concatenating several 
everyday words. Suppose that 10 students at a certain university are randomly selected, and it is found 
that the first, third, and tenth students use passphrases for their email accounts, whereas the other 
seven students do not. Let p = P(passphrase); i.e., p is the proportion of all students at the university 
using a passphrase on their email accounts. Define Bernoulli random variables X,, X2, ..., Xigq by 


{ 1 if the ith student uses a passphrase 
Xi = : a 
O if not 


For the obtained sample, x; = x3 = x19 = 1 and the other seven x;’s are all zero. Students’ decisions 
about whether to use passphrases are presumably independent of one another, so that the X;’s are 
independent and the probability of observing the obtained sample is 


p:(1—p)-p-(1—p):(1—p)---p=pr(1—p)’ (7.4) 


We now ask, “For what value of p is the obtained sample most likely to have occurred?” That is, we 
wish to find the value of p that maximizes the joint pmf (7.4) or, equivalently, maximizes the natural 
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log of (7.4) Figure 7.4a shows a graph of the likelihood (7.4) as a function of p. It appears that the 
graph reaches its peak above p = .3, which is the proportion of passphrases in the sample. Figure 7.4b 
shows a graph of the natural logarithm of (7.4), whose maximum will occur at the same value. 


a b 
Likelihood In(likelihood) 

.0025 5 

0020 = 
-15 

.0015 
-20 

.0010 
-25 

.0005 30 

0 P -35 Dp 
0 2 4 6 8 1.0 0 2 A 6 8 1.0 


Figure 7.4 Likelihood and log likelihood plotted against p 


Here, 


In[p>(1 — p)"] = 3In(p) +7 In(1 — p) 
3° 7 


d 3 
In[p>(1 7) = a ee ee 
Pa P)'| ie P= 


So p = 3/10 = .30 maximizes the (log of the) probability of the specified sample, as conjectured. For 
that reason, the point estimate p = .30 is called the maximum likelihood estimate of the parameter p.° 
To be clear, we could also have differentiated (7.4) directly and set that derivative equal to 0 to obtain 
the same result; taking the logarithm simply made the calculus easier. 

Now suppose that rather than being told each individual student’s decision, we had only been 
informed that three of the ten used passphrases. Then we would have the observed value of the 


binomial random variable X = the number of passphrases. The pmf of X is () pii- py for 


x = 3, this becomes e p>(1 —p)’. The binomial coefficient e ) is irrelevant to the maximization, 


and so the value of p that maximizes the likelihood of observing X = 3 is again p = .30. ea 


?Since the natural logarithm is a monotone function, finding u to maximize In[g(w)] is equivalent to finding u to 
maximize g(u). Taking the logarithm will frequently make differentiation easier. 

In general, the second derivative should be examined to make sure a maximum has been obtained, but here this is 
obvious from the figure. 
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DEFINITION Let Xj, ..., X,, have a joint distribution (i.e., a joint pmf or pdf) that depends 
on a parameter 0 whose value is unknown. This joint distribution, regarded as a 
function of 0, is called the likelihood function and is denoted by L(0). The 
maximum likelihood estimate (mle) 0 is the value of @ that maximizes the 
likelihood function. 


Echoing the terminology from the previous section, we call 0 a maximum likelihood estimate if it’s 


expressed in terms of our observed sample data and a maximum likelihood estimator if 0 is regarded 
as a function of the random variables X,, ..., Xn. 

In Example 7.16, the joint pmf of Xj, ..., Xjo became p*(1 — p)’ once the observed values of the 
X;’s were substituted. So, the likelihood function would be written L(p) = pd - p)’. If we take the 


perspective that our data consists of a single binomial observation, then L(p) =f ) p(1—p)'. In 


either case, the value of p that maximizes L(p) is p = .3. 

The likelihood function tells us how likely the observed sample is, as a function of the possible 
parameter value. Maximizing the likelihood gives the parameter value for which the observed sample 
is most likely to have been generated, that is, the parameter value that “agrees most closely” with the 
observed data. As in Example 7.16, maximizing the likelihood is equivalent to maximizing the 
logarithm of the likelihood; the latter is typically computationally easier, since the likelihood is 
typically a product and so its logarithm is a sum. We will use £(0) to denote the natural logarithm of 
the likelihood function, £(@) = In[L(@)], commonly referred to as the log-likelihood function. 


Example 7.17 Suppose X,, ..., X, is a random sample from an exponential distribution with 
parameter 7. Because of independence, the likelihood function is a product of the individual pdfs: 


LO =fhi. i DaUe ae a ele 
Next, we determine the value of 2 that maximizes the logarithm of this function: 


e(A) = In[L(A)] = nIn(A) — 2S 0x; 
77 x =0> 


Thus the mle is 1 = 1 /X. This is exactly the same as the mme that we found in Example 7.13; as 


noted previously, the two methods often yield the same estimator. Unfortunately, / is not an unbiased 
estimator (see Exercise 13), since E(1/X) 4 1/E(X). Te 


Example 7.18 In Chapter 3, we indicated that the Poisson distribution could be used for modeling 
the number of events of some sort that occur in a two-dimensional region (e.g., the occurrence of 
tornadoes in a particular Midwest county during a given time period). Assume that when the region 
R being sampled has area a(R), the number X of events occurring in R has a Poisson distribution with 
mean 4 - a(R), so 2 represents the expected number of events per unit area, and that nonoverlapping 
regions yield independent X’s. (This is called a spatial Poisson process.) 

Suppose an ecologist selects n nonoverlapping regions R, ..., R, and counts the number of plants 
of a certain species found in each region. The joint pmf (likelihood) is then 
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[Ava(Ri)[Me tee) [A - a(R, ) Pee aR) 
L(A) = p(x1,---5Xnj 2) = ea a 
x Xn ym i ker a(R; 
[a(Ri P+ [a(Rn) [AM - e cai C. JEM . p-7Za(Ri) 
— x,! anaes Xy! 5) 


where the quantity C does not involve the parameter / (and, hence, will not impact maximization). 
Then, 


O(A) = In[L(A)] = In(C) + In(d) - Sx; — ZS a(R) 
l(a) =0+ a — S~ a(R) =0> 


The mle is 2 = > X;/ >> a(R;). This is intuitively reasonable because / is the true density (plants per 
unit area), whereas / is the sample density: )- X; is the number of plants counted, and }~ a(R;) is just 
the total area sampled. Because E(X;) = 2 - a(R;), the estimator is unbiased. 

Sometimes an alternative sampling procedure is used. Instead of fixing regions to be sampled, the 
ecologist will select n points in the entire region of interest and let y; = the distance from the ith point 
to the nearest plant. The cdf of Y = distance to the nearest plant is 


no plants in a 
yO SPY Sy) Sl are oy a1 Pi 
circle of radius y 


Sag 


Taking the derivative of F\{y) with respect to y yields 


fr(y3A) = 2ndye?™ y>0 


If we now form the likelihood L(A) = fy; 2) +--+ > fron; 2), differentiate In[L(1)], and so on, the 
resulting mle is 
| number of plants observed 
oe bo) ae total area sampled 


which is also a sample plant density. It can be shown that in a sparse environment (small 4), the 
distance method is in a certain sense better, whereas in a dense environment, the first sampling 
method is better. Ei 
The definition of mles can be extended in the natural way to distributional families that include two 
or more parameters. The mles of parameters 0),..., 0, are those values 6, eas On that maximize the 
likelihood function L(0,,...,0,,) or, equivalently, the logarithm of the likelihood function. 
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Example 7.19 Let Xj, ..., X,, be a random sample from a normal distribution, which includes the 
two parameters yu and o. The likelihood function is 


1 alge =i) ae 
L(u, @) =a ris iaclas HAG) ae CoH Sea 


= (200?) "7 ¢-E ein)" /(20*) 


so 


fa 


(nu, 0) = In[L(u, o)| = 5 n(2n) —nIno 5a (Hi uy) 


To find the maximizing values of and o, we must take the partial derivatives of £(, 0) with respect 
to both x and o, equate them to zero, and solve the resulting two equations. Omitting the details, the 
resulting mles are 


Notice that the mle of o is not the sample standard deviation, S, since the denominator in the mle is 
n and not n — 1. Ba 


Example 7.20 Let X,, ..., X, be a random sample from a Weibull pdf 


F(x, B) =p -e G/B" x>0 


Writing the likelihood L(«, f) and log-likelihood ¢(«, 6), then setting both 0€/O« = 0 and 0¢/08 = 0 
yields the equations 


a— 


oe ete) es jp (=a\" 


These two equations cannot be solved explicitly to give general formulas for the mles & and B. 
Instead, for any sample x), ..., x,, the equations must be solved using an iterative numerical 
procedure. 

The iterative mle computations can be done using statistical software. In R, the command 


fitdistr(x, \“‘weibull'') will return % and B assuming the data is stored in the vector x (the 
MASS package must be installed first). As an example, consider the following data on the survival 
time (weeks) of male mice subjected to 240 rads of gamma radiation (from A. J. Gross and V. Clark, 
Survival Distributions: Reliability Applications in the Biomedical Sciences): 


152.115 109 94 88 137. 152 77 ~=160— ‘165 
125 40 128 123 136101 62 153 83 69 


A Weibull probability plot supports the plausibility of assuming that survival time has a Weibull 
distribution. With the aid of software, maximum likelihood estimates of the Weibull parameters are 
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& = 3.799 and B = 125.88. Figure 7.5 shows the Weibull log likelihood as a function of both « and 
B. The surface near the top has a rounded shape, allowing the maximum to be found easily, but for 
some distributions the surface can be much more irregular, and the maximum may be hard to find. 
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Figure 7.5 Weibull log likelihood for Example 7.20 a 


Some Properties of MLEs 
In Example 7.19, we obtained the mle of o when the underlying distribution is normal. The mle of 0”, 
as well as many other mles, can be easily derived using the following proposition. 


MLE INVARIANCE 


Let 6, 0, Easy On be the mles of the parameters 0;, 02,..., 0. Then the 
PRINCIPLE 


mle of any function (0), 02,...,0m) of these parameters is the 
function h(0,, Oo,..., On) of the mles. 


For an intuitive idea of the proof, consider the special case m = 1, with 0; = 0, and assume that h(-) is 
a one-to-one function. On the graph of the likelihood as a function of the parameter 0, the highest 
point occurs where 0 = 0. Now consider the graph of the likelihood as a function of h(0). In the new 
graph the same heights occur, but the height that was previously plotted at 0 = a is now plotted at 


h(@) = h(a), and the highest point is now plotted at h(@) = h(@). Thus, the maximum remains the 


same, but it now occurs at h(@). 
Example 7.21 (Example 7.19 continued) In the case of a random sample from a normal distribution, 


the mles of y and o are ji=X and & = \/3>(X;—X)*/n. To obtain the mle of the function 


h(u, 6) = oO’, substitute the mles into the function: 


—— 1 _ 
=e =-) (X;-X) 
n 


The mle of o” is not the unbiased estimator (the sample variance S? is), although they are close when 
n is large. Similarly, the mle of the population coefficient of variation, defined by h(u,a) = 100,/c, is 
simply 100ji/¢. a 
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Example 7.22 (Example 7.20 continued) From Section 4.5, the mean value of a Weibull rv X is 
w= Pp-TU+1/a) 


The mle of jis therefore ji = f - (1+ 1/%), where &and f are the mles of « and f. In particular, the 
mle of yin this case is not the mme X, although the latter is an unbiased estimator. At least for large n, 
ji is a better estimator than X because the mle has lower mean squared error. a 


The method of maximum likelihood estimation has considerable intuitive appeal. The following 
proposition provides additional rationale for the use of mles; see Section 7.4 for more details. 


THEOREM Under very general conditions on the joint distribution of the sample, when the 
sample size is large, the maximum likelihood estimator of any parameter 0 (1) is 
close to @ (consistency), (2) is approximately unbiased, and (3) has variance that 
is nearly as small as can be achieved by any unbiased estimator. Stated another 


way, the mle 0 is at least approximately the MVUE of 0. 


Because of this result and the fact that calculus-based techniques can usually be used to derive the 
mles (although numerical methods, such as Newton—Raphson, are sometimes necessary), maximum 
likelihood estimation is the most widely used estimation technique among statisticians. Many of the 
estimators used in the rest of this book are mles. 

One consequence of the preceding theorem is that when the mle and the moments estimator differ for 
a given distribution, the mle will nearly always have smaller variance. Thus, although formulas for 
mmes are often easier to determine, the extra computation required for mles is typically worth the price. 


Some Complications 
Sometimes calculus cannot be used to obtain mles. 


Example 7.23 Suppose the waiting time for a bus is uniformly distributed on [0, 0] and the results 
X1, ...) X, Of a random sample from this distribution have been observed. Since f(x; 0) = 1/0 for 
0 < x < 0 and 0 otherwise, 


L(0) =f (x1,-. 


oe 1/0" O0<x1<0,...,0<%, <0 
ates 0 otherwise 


As long as 0 > max(x;), L(0) = 1/60" > 0, but for 0 < max(a;), the likelihood drops to 0. This is 
illustrated in Figure 7.6. Calculus will not work because the maximum of the likelihood occurs at a 


point of discontinuity, but the figure shows that the mle is d= max(x;). Thus if my waiting times are 
2.3, 3.7, 1.5, 0.4, and 3.2, then the mle is 0 = 3.7. Note that this mle is biased (see Example 7.6). 


Likelihood 


max(x;) r) 


Figure 7.6 The likelihood function for Example 7.23 | 


7.2. The Methods of Moments and Maximum Likelihood 425 


Example 7.24 A method often used to estimate the size of a wildlife population involves performing 
a capture/recapture experiment. In this experiment, an initial sample of M animals is captured, each 
of these animals is tagged, and the animals are then returned to the population. After allowing enough 
time for the tagged individuals to mix into the population, another sample of size n is captured. With 
X = the number of tagged animals in the second sample, the objective is to use the observed x to 
estimate the population size N. 

The parameter of interest is 0 = N, which can assume only integer values, so even after deter- 
mining the likelihood function (the pmf of X here), using calculus to obtain N would present diffi- 
culties. If we think of a “success” as a previously tagged animal being recaptured, then the sampling 
is without replacement from a population containing M successes and N — M failures, so that X is a 
hypergeometric rv and the likelihood function is 


olen 
x n-Xx 
L(N) = h(x;n,M,N) = 
N 
(") 
The integer-valued nature of N notwithstanding, it would be difficult to take the derivative of L(N). 
However, let’s consider the ratio of L(N) to L(N — 1): 


LN) (N — M) - (Nn) 


L(N — 1) N(N — M —n+x) 


This ratio is larger than | if and only if N < Mn/x. The value of N for which L(V) is maximized is 
therefore the largest integer less than Mn/x. If we use standard mathematical notation [r] for the 
greatest integer less than or equal to r, the mle of N is N = [Mn/x]. As an illustration, if M = 200 fish 
are taken from a lake and tagged, subsequently n = 100 fish are recaptured, and among the 100 there 
are x = 11 tagged fish, then N = [(200)(100) /11] = [1818.18] = 1818. 

The estimate is actually rather intuitive; x/n is the proportion of the recaptured sample that is 
tagged, whereas M/N is the proportion of the entire population that is tagged. The estimate is obtained 
by equating these two proportions (estimating a population proportion by a sample proportion). Mf 


Obtaining an mle requires that the underlying distribution be specified. Suppose X), Xo, ..., X, is a 
random sample from some pdf f(x; 0) that is symmetric about 0, but the investigator is unsure of the 
form of the ffunction. It is then desirable to use an estimator that is robust, that is, one that performs 
well for a wide variety of underlying pdfs. One such estimator, called an M-estimator, is based on a 
generalization of maximum likelihood estimation. Instead of maximizing the log-likelihood 
>= In[f(x; 6)] for a specified f, one maximizes >~ W[f(x;@)], where the “objective function” W is 
selected to yield an estimator with good robustness properties. The book by David Hoaglin et al. (see 
the bibliography) contains a good exposition on this subject. 


Exercises: Section 7.2 (25-37) a. Derive the maximum likelihood estima- 
tor of p. Ifn = 20 and x = 3, what is the 
estimate? 

b. Is the estimator of part (a) unbiased? 

c. If n = 20 and x = 3, what is the mle of 
the probability 0 = (1 — p)° that none of 
the next five helmets examined is 
flawed? 


25. A random sample of n bike helmets man- 
ufactured by a company is selected. Let 
X =the number among the n that are 
flawed, and let p = P(flawed). Assume that 
only X is observed, rather than the sequence 
of S’s and F’s. 
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26. 


27. 


28. 


Let X have a Weibull distribution with 
parameters « and f, so 


E(X) = B-T(1+1/a) 
V(X) = P{T(1 +2/a) — (+ 1/a))7} 


a. Based on a random sample X,, ..., X,, 
write equations for the method of 
moments estimators of / and «. Show 
that, once the estimate of « has been 
obtained, the estimate of / can be found 
using the gamma function and that the 
estimate of « is the solution to a com- 
plicated equation involving the gamma 
function. 

bE w= 20, x7= 280, and > x= 
16,500, compute the estimates. [Hint: 
(Pd.2)P/0(1.4) = .95.] 


Let X denote the proportion of allotted time 
that a randomly selected student spends 
working on a certain aptitude test. Suppose 
the pdf of X is 


F(x;0) = (@+1)x° O<x<1 


for some 0 > —1. A random sample of ten 

students yields data x, = .92, x2 = .79, 

x3 = .90, x4 = .65, x5 = .86, XxX = .47, 

x7 = 73, Xg = 97, Xo = 94, X10 = 77, 

a. Use the method of moments to obtain an 
estimator of 0, and then compute the 
estimate for this data. 

b. Obtain the maximum likelihood esti- 
mator of 0, and then compute the esti- 
mate for the given data. 


Two different computer systems are moni- 
tored for a total of n weeks. Let X; denote 
the number of breakdowns of the first sys- 
tem during the ith week, and suppose the 
X;s are independent and drawn from a 
Poisson distribution with parameter jy. 
Similarly, let Y; denote the number of 
breakdowns of the second system during 
the ith week, and assume independence 
with each Y; Poisson with parameter “>. 
Derive the mles of “), Mo, and wy — po. 
[Hint: Using independence, write the joint 


29. 


30. 


31. 
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pmf (likelihood) of the X;’s and Y;’s toge- 
ther. ] 


Refer to Exercise 25. Instead of selecting 
n = 20 helmets to examine, suppose we 
examine helmets in succession until we 
have found r = 3 flawed ones. If the 20th 
helmet is the third flawed one, what is the 
mle of p? Is this the same as the estimate in 
Exercise 25? Why or why not? Is it the 
same as the estimate computed from the 
unbiased estimator of Exercise 19? 


Six Pepperidge Farm bagels were weighed, 
yielding the following data (grams): 


117.6 109.5 111.6 109.2 119.1 110.8 


a. Assuming that the six bagels are a ran- 
dom sample and the weight is normally 
distributed, estimate the true average 
weight and standard deviation of the 
weight using maximum likelihood. 

b. Again assuming a normal distribution, 
estimate the weight below which 95% of 
all bagels will have their weights. [Hint: 
What is the 95th percentile in terms of u 
and o? Now use the invariance 
principle. ] 

c. Suppose we choose another bagel and 
weigh it. Let X = weight of the bagel. 
Use the given data to obtain the mle of 
the probability P(X < 113.4). [Hint: 
P(X < 113.4) = ®[(113.4 - p/o)).] 


Suppose a measurement is made on some 
physical characteristic whose value is 
known, and let X denote the resulting 
measurement error. It is often reasonable to 
assume that E(X)=0 and that X has a 
normal distribution. Thus, the pdf of any 
particular measurement error is 


1 2 
=e 
f(x; 9) ia 
where @ denotes the population variance. 
Let X;, ... X, be a random sample of such 
measurement errors. 


a. Determine the likelihood function of 0. 


7.2. The Methods of Moments and Maximum Likelihood 


32. 


33. 


34. 


b. Obtain and simplify the log-likelihood 
function. 

c. Differentiate the log-likelihood function 
to determine the mle of 0. 

d. The precision of a normal distribution is 
defined as t = 1/0. Find the mle of t. 


Let X;, ..., X, be a random sample from a 

gamma distribution with parameters « and [. 

a. Derive the equations whose solution 
yields the maximum likelihood estima- 
tors of « and f. Does it appear that they 
can be solved explicitly? 

b. Show that the mle of pz = af is f= X. 


Let X,, Xz, ..., X, represent a random 
sample from the Rayleigh distribution with 
density function given in Exercise 17. 


a. Determine the maximum likelihood 
estimator of @ and then calculate the 
estimate for the blood plasma beta con- 
centration data given in that exercise. Is 
this estimator the same as the unbiased 
estimator suggested in Exercise 17? 

b. Determine the mle of the median of the 
blood plasma beta concentration distri- 
bution. [Hint: First express the median 
of the Rayleigh distribution in terms 
of 0.] 


Consider a random sample X,, Xo, ... 
from the shifted exponential pdf 


»X, 


f(x;4,0) =de*-9 x > 9 


Taking 0 = 0 gives the pdf of the expo- 
nential distribution considered previously 
(with positive density to the right of zero). 
An example of the shifted exponential 
distribution appeared in Example 4.5, in 
which the variable of interest was haz- 
ardous flood rate and 0 was the lowest 
water flow rate considered hazardous. 
a. Obtain the maximum likelihood estima- 
tors of @ and A. 
b. If n = 10 hazardous flood rate observa- 
tions are made, resulting in the values 
13.11, 10.64, 12.55, 12.20, 15.44, 


35. 


36. 


31. 
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13.42, 20.39, 18.93, 27.82, and 11.30, 
calculate the maximum likelihood esti- 
mates of 0 and 2. 


Twenty identical components are put on test. 
The lifetime distribution of each is expo- 
nential with parameter 2. The experimenter 
then leaves the test facility unmonitored. On 
her return 24h later, the experimenter 
immediately terminates the test after noticing 
that y = 15 of the 20 components are still in 
operation (so 5 have failed). Derive the mle 
of 4. [Hint: Let Y = the number that survive 
24 h. Then Y ~ Bin(n, p). What is the mle 
of p? Now notice that p = P(X; > 24), 
where X; is exponentially distributed. This 
relates 2 to p, so the former can be estimated 
once the latter has been.] 


The article “A Model of Pedestrians’ 
Waiting Times for Street Crossings at Sig- 
nalized Intersections” (Transp. Res. 2013: 
17-28) suggested that under some circum- 
stances the distribution of waiting time 
X could be modeled with the following pdf: 


f(x;9,7) = mal —x/t)?" O<x<t 


t 
where 0 > 0 and t > O. 


a. Suppose we observe a random sample of 
waiting times Xj, ..., X,, and suppose 
that the value of the parameter t is 
known. Find the mle of 0. 

b. Suppose instead that 0 is known but t is 
unknown. Determine an equation whose 
solution is the mle of t. 


Let X1,...,X, be a random sample from the 

Laplace distribution (also called the double 

exponential distribution) with pdf 

f(x; 0) = e-*-4 for —00 < x < 00. 

a. Determine the method of moments esti- 
mator for 0. 

b. Determine the maximum likelihood 
estimator for 0. [Hint: It can be shown 
that the expression > |x; — c| is mini- 
mized by c = x, the median of the x;’s.] 
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An investigator who wishes to make an inference about some parameter 0 will base conclusions on 
the value of one or more statistics—the sample mean X, the sample standard deviation S, the sample 
range Y, — Y,, and so on. Intuitively, some statistics will contain more information about 0 than will 
others. Sufficiency, the topic of this section, will help us decide which functions of the data are most 
informative for making inferences. 

As a first point, we note that a statistic T = t(X), ..., X,) will not be useful for drawing conclusions 
about 0 unless the distribution of T depends on 6. Consider, for example, a random sample of size 
n= 2 from a normal distribution with mean y and variance o°, and let T = X, — X>. Then T has a 
normal distribution with mean 0 and variance 20°, which does not depend on yw. Thus this statistic 
cannot be used as a basis for drawing any conclusions about yu, although it certainly does carry 
information about the variance o”. 

The relevance of this observation to sufficiency is as follows. Suppose an investigator is given the 
value of some statistic T, and then examines the conditional distribution of the sample X),...,X, 
given the value of the statistic—for example, the conditional distribution given that T = X = 28.7. If 
this conditional distribution does not depend upon @, then it can be concluded that there is no 
additional information about 0 in the sample over and above what is provided by T. In this sense, for 
purposes of making inferences about 0, it is sufficient to know the value of T, which contains all 
information in the data relevant to 0. 


Example 7.25 An investigation of major defects on new vehicles of a certain type involved selecting 
an initial random sample of n = 3 vehicles and determining for each one the value of X = the number 
of major defects. This resulted in observations x, = 1, x. = 0, and x3 = 3. You, as a consulting 
statistician, have been provided with a description of the experiment, from which it is reasonable to 
assume that X has a Poisson distribution, but you have been told only that the total number of defects 
T for the three sampled vehicles was 4. 

Knowing that T = 5+ X; = 4, would there be any additional advantage in having the observed 
values of the individual X;’s when making an inference about the Poisson parameter 4? Or, is it 
instead the case that the statistic T contains all relevant information about yw in the data? To address 
this issue, consider the conditional distribution of (X,, X2, X3) given that }* X; = 4. First of all, there 
are only a few possible (x1, x2, x3) triples for which x, + x2 + x3 = 4. For example, (0, 4, 0) is a 
possibility, as are (2, 2, 0) and (1, 0, 3), but not (1, 2, 3) or (5, 0, 2). That is, 


P(X x1,X2 = x2, X3 x3|T 4) O unless xj +.4%.+2%3 =4 


Now consider the triple (2, 1, 1), which is consistent with T= 4. A moment generating function 
argument shows that T has a Poisson distribution with parameter 3u. From this we calculate the 
conditional probability that (X1, X2,X3) = (2,1, 1), given T = >> X; = 4, as follows: 


P(X, 2,X2 1, X3 1nT 4) 


P(X, =2,X) = 1,X3 =1|T =4) 


P(T =4) 
_ P(X = 2,X. = 1,X3 = 1) 
- P(T =4) 
eH 2 el 1 eu! 
zt : it an 4 
7 esi os) 


4 
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The particular probability isn’t important; what’s critical is that this conditional probability does 
not depend on the unknown parameter wu. The same holds true for every other triple that sums to 4, 
indicating that the conditional distribution of (X), X2,X3) given T does not involve 4. Thus once the 
value of the statistic T = }~ X; has been provided, there is no additional information about yu in the 
individual observations. 

To put this another way, think of obtaining the data from the experiment in two stages: 

1. Observe the value of T = X, + X> + X3 from a Poisson distribution with parameter 3w. 

2. Having observed T = 4, now obtain the individual x;’s from the conditional distribution 


P(X = x1, Xo = x2,X3 = x3|T = 4) 


Since the conditional distribution in step 2 does not involve uw, there is no additional information 
about yw resulting from the second stage of the data generation process. This argument holds more 
generally for any sample size n and values of ¢ other than 4 (e.g., the total number of defects among 
10 randomly selected vehicles might be 5+ X; = 16). Once the value of 5X; is known, there is no 
further information in the data about the Poisson parameter; it is “sufficient” to be told the total. 


DEFINITION A statistic T = (Xj, ..., X,) is said to be sufficient for making inferences about a 
parameter 0 if the joint distribution of X;,...,X, given that T= tf does not 
depend upon 0, for every possible value ¢ of the statistic T. 


The notion of sufficiency formalizes the idea that a statistic T contains all relevant information about 
0. Once the value of T for the given data is available, it is of no benefit to know anything else about 
the sample. 


The Factorization Theorem 

How can a sufficient statistic be identified? It may seem as though one would have to select a statistic, 
determine the conditional distribution of the X;’s given any particular value of the statistic (no easy 
task—look at the last example!), and keep doing this until hitting paydirt by finding one that satisfies 
the defining condition. This would be terribly time-consuming, and when the X;’s are continuous 
there are additional technical difficulties in obtaining the relevant conditional distribution. Fortu- 
nately, the next result provides a relatively straightforward way of proceeding. 


THE NEYMAN Let f(x1,...,%nj@) denote the joint pmf or pdf of Xj,...,X,. Then T= 
FACTORIZATION t(X,, ..., X,) is a sufficient statistic for 0 if and only if there exist functions 
THEOREM g and h such that 


F(x1, ++ -5%nj 8) oe a(t(x1,..+,Xn); 8) Poe seeeere sy 


That is, the joint pmf or pdf can be represented as a product of two factors, in 
which one factor includes 6 and involves the data only through f(x, ..., x,) 
while the other factor does not depend on 0. 


Before sketching a proof of this theorem, we consider several examples. 
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Example 7.26 Let’s generalize the previous example by considering a random sample X\,..., Xn 
from a Poisson distribution with parameter yu, for example, the numbers of blemishes on n indepen- 
dently selected iPhone cases or the numbers of errors in n batches of tax returns where each batch 
consists of many returns. The joint pmf of these variables is 


Hyp =k, Xn 1 
f(m1, sory Any Lt) = ae ee el is = (ere) ee 
x! Xn! 


x! x,! Pewee, 


The factor inside the first set of parentheses includes the parameter “ and involves the data only 
through 5~ X;, whereas the factor inside the second set of parentheses does not depend on uw. So we 
have the desired factorization, and by the factorization theorem the sufficient statistic for w is 
T = )_X;j, as we ascertained in Example 7.25 directly from the definition of sufficiency. fi 


A sufficient statistic is not unique: any one-to-one function of a sufficient statistic is itself sufficient. 
In the Poisson example, the sample mean X = (1/n) 5> X; is a one-to-one function of }> X; (knowing 
the value of the sum of the 7 observations is equivalent to knowing their mean), so the sample mean is 
also a sufficient statistic. 


Example 7.27 Suppose that the waiting time for a bus on a weekday morning is uniformly dis- 
tributed on the interval from 0 to 0, and consider a random sample X), ..., X, of waiting times (i.e., 
times on n independently selected mornings). The joint pdf of these times is 


F (x1, «+ +:%n; 8) = = an 0<4 <0,...,0Sa%, <0 


1 
0 Q” 


Sle 
>] — 


To obtain the desired factorization, we introduce notation for an indicator function: /(A) = 1 if the 
statement A is true, and (A) = 0 otherwise. For instance, we may write the joint pdf of the wait times 
more formally as 


f(%1,-- 5% 0) == 


The statement A is that all x;’s are between 0 and @. But the x;’s will all be between 0 and @ if and only 
if (1) the smallest of the x;’s is at least 0 and (2) the largest is at most 0. Thus, the joint pdf can be 
reexpressed as 


Ff (x1, -+-;Xnj 9) = Gl(O< min(xy,...,%) and max(x1,...,%n) <0) 
= arl(max(x1,.- a) <0) -T(0< min(x1, ...,xn)) 


The factor inside the square brackets includes 0 and involves the x;,’s only through the function 
t(x1,.--;X%n) = max(x),...,X,). Voila, we have our desired factorization, and the sufficient statistic 
for the uniform parameter @ is T = max(X1,...,X,). All the information about @ in this uniform 
random sample is contained in the largest of the n observations; knowing the values of the other n — 1 
observations provides no further information toward estimating 0. a 


Proof of the Factorization Theorem A general proof when the X;’s constitute a random sample 
from a continuous distribution is fraught with technical details that are beyond the level of our text. So 
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we content ourselves with a proof in the discrete case. For the sake of concise notation, denote X,, X>, 
wees X by X and x1, X, ..., X, by xX. 

Suppose first that T = 7(x) is sufficient, so that PX =x |T = 1) does not depend upon 6. Focus on 
a value t for which ¢(x) = ¢ (e.g.,x = (3, 0, 1) and r(x) = > x;, so t = 4). The event that X = x is then 
identical to the event that both X¥ =x and T = tf because the first equality implies the second one. 
Thus 


f(x; 0) = P(X =x; 0) = P(X =xNT =12,0) 
= P(X =x|T =1;0)-P(T =; 0) = P(X = x|T =1t)- P(T =1;0) 


Since the first factor in the last product does not involve @ and the other involves the data only 
through ¢, we have our desired factorization. 

Now let’s go the other way: assume a factorization, and show that T is sufficient, i.e., that the 
conditional probability that X =x given that T = tf does not involve 0. 


mp _ P(X =xNT=1;0) P(X =x;0) | g(t; O)h(x) 
P= P= 89) =~ Sere) P= 88) Soa P® = 00) 
g(t; O)h(x) _ g(t; O)h(@) h(x) 


Durtu)=r 8); 8) h(t) Vovauyae 8 (68) RC) uacay=r hH) 
Sure enough, this final ratio does not involve 0. a 


Jointly Sufficient Statistics 

When the joint pmf or pdf of the data involves a single unknown parameter 0, there is frequently a 
single statistic (single function of the data) that is sufficient. However, when there are several 
unknown parameters—for example, the mean y and standard deviation o of a normal distribution, or 
the shape parameter « and scale parameter f of a gamma distribution—we must expand our notion of 
sufficiency. 


DEFINITION Suppose the joint distribution of X,,...,X, involves m unknown parameters 0), 
0, ..., Om, The k statistics T; = t1(X1,...,Xn),---, Te = te(X1,...,Xn) are said to 
be jointly sufficient for the parameters if the conditional distribution of the X;’s 
given that 7; = 4, ..., T;, = t, does not depend on any of the unknown param- 
eters, and this is true for all possible values 1), fo, ..., t, of the statistics. 


Example 7.28 Consider a random sample X,, X2, X3 of size n = 3 from any continuous distribution, 
and let T; < T> < T; be their ordered values (these were denoted Y; < Y2 < Y3 in Section 5.7.) Then 
given, for example, that the three ordered values are 21.4 < 23.8 < 26.0, the original X;’s are equally 
likely to be any one of the 3! = 6 permutations of these numbers: (23.8, 21.4, 26.0), (26.0, 23.8, 21.4) 
and so on. More formally, for any values f1, t2, and f3 satisfying t) < tp <b, 


P(X, = %1,Xq = %,X3 = ¥3|T, = 4, Ty = , Tz = 8) 


_ J 1/3! (a1, x2,%3) = (t1, ta, 03),.- + (t3, 2, 1) 
0 otherwise 
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This conditional distribution clearly does not involve any unknown parameters. Generalizing this 
argument to a sample of size n, we see that for a random sample from a continuous distribution, the 
n ordered values are jointly sufficient for 0;, 02, ..., 0,,. regardless of whether m= 1 (e.g., the 
exponential distribution has a single parameter) or 2 (the normal distribution) or even m > 2. a 


The factorization theorem extends to the case of jointly sufficient statistics: T;,..., 7, are jointly 
sufficient for 0;,..., 0, if and only if the joint pmf or pdf of the X;’s can be represented as a product of 
two factors, where the first includes the 0;’s and involves the data only through f),...,t, and the 
second does not involve the 0;’s. 


Example 7.29 Let X,, ..., X,, be a random sample from a N(j1, 0) distribution. The joint pdf is 


ua 2 2 “ n/2 
f(x1,..-%3 1,0) = T] Low b-w)? (20%) [| -(2xF — 2uEx; + np?) /020%)| | (1 
———— i=1 V 2107 a” On 


This factorization shows that the two statistics )°X; and )>X? are jointly sufficient for the two 
parameters js and o. Since )>(X;—X)? = ye —n(X)’, there is a one-to-one correspondence 


between the two sufficient statistics and the statistics X and SOX; — xy’; that is, values of the two 
original sufficient statistics uniquely determine values of the latter two statistics, and vice versa. This 
implies that the latter two statistics are also jointly sufficient, which in turn implies that the sample 
mean and sample standard deviation are jointly sufficient statistics. The sample mean and sample 
standard deviation (or sample variance) encapsulate all the information about and o that is 
contained in the sample data. a 


Minimal Sufficiency 

When X,,...,X, constitute a random sample from a normal distribution, the n ordered values are 
jointly sufficient for 4 and o (see Example 7.28), and the sample mean and sample sd are also jointly 
sufficient (as shown in Example 7.29). Both the ordered values and the pair (X,S) reduce the data 
without any information loss, but the sample mean and variance represent a greater reduction. In 
general, we would like the greatest possible reduction without information loss. A minimal (possibly 
jointly) sufficient statistic is a function of every other sufficient statistic. That is, given the value(s) of 
any other sufficient statistic(s), the value(s) of the minimal sufficient statistic(s) can be calculated. 
A minimal sufficient statistic is the sufficient statistic having the smallest dimensionality, and thus 
represents the greatest possible reduction of the data without any information loss. 

A general discussion of minimal sufficiency is beyond the scope of our text. In the case of a normal 
distribution with values of both yp and o unknown, it can be shown that the sample mean and sample 
sd are jointly minimal sufficient (so the same is true of )> X; and yx. It is intuitively reasonable 
that because there are two unknown parameters, there should be a pair of sufficient statistics. It is 
indeed often the case that the number of minimal sufficient statistic(s) matches the number of 
unknown parameters. But this is not always true. Consider a random sample X,,...,X, from the pdf 
fs0) = U{ nfl + (« — 6)]°}, ie., from a Cauchy distribution with location parameter 0. Because the 
Cauchy distribution is continuous, the n ordered values are jointly sufficient for 0. It would seem, 
though, that a single sufficient statistic (one-dimensional) could be found for the single parameter 
0. Unfortunately this is not the case: it can be shown that the ordered values are minimal sufficient! 
So going beyond the ordered values to any single function of the X;’s as a point estimator of 0 entails 
a loss of information from the original data. 
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Improving an Estimator 

Because a sufficient statistic contains all the information the data has to offer about the value of 0, it is 
reasonable that an estimator of 0, or any function of 0, should depend on the data only through the 
sufficient statistic. A general result due to C. R. Rao and David Blackwell shows how to start with an 
unbiased statistic that is not a function of sufficient statistics and create an improved estimator that is 
both unbiased and sufficient. 


RAO-BLACKWELL Suppose that the joint distribution of X),...,X, depends on some unknown 

THEOREM parameter 0 and that T is sufficient for 0. Consider estimating h(@), a 
specified function of 0. If U is any unbiased estimator for estimating h(0), 
then the estimator U* = E(UJT) is also unbiased for h(@) and has variance 
no greater than the original unbiased estimator U. 


Proof First of all, we must show that U* is indeed an estimator—i.e., that it is a function of the X;’s 
and not of 0. This follows because, given that T is sufficient, the distribution of U conditional on 
T does not involve 0, so the expected value E(U |T) will of course not involve 0. Second, the fact that 
U and U* have the same expected value (i.e., they are both unbiased estimators of 4(@)) follows from 
the Law of Total Expectation introduced in Section 5.4: 


E(U*) = E|E(U|T)| = E(U) = h(8) 
Finally, the fact that U* has smaller variance than U is a consequence of Law of Total Variance: 
V(U) = VIE(U|T)] + E[V(U|T)| = V(U") + E[V(U|T)] 
Because V(U|T), being a variance, is positive, it follows that VU) > V(U*) as desired. a 


Example 7.30 Suppose again that the number of major defects on a randomly selected new vehicle 
of a certain type has a Poisson distribution with parameter uw. Now consider estimating e~", the 
probability that a vehicle has no such defects, based on a random sample of n vehicles. Let’s start with 
the very simple estimator 


_ fi #x=0 
u={4 if X; >0 


Using indicator function notation, this could be abbreviated U = 1(X, = 0). Then 
et po 
0! 


e" 


BU) = 1 PR 0) 40+ PR > 0) = POG = 0) 


Our estimator is therefore unbiased for estimating the probability of no defects. But the sufficient 
statistic here is T = }> Xj, and of course the estimator U is not a function of T. The improved 
estimator is U* = E(U|T) = P(X; = 0| )~X;). The event that X, = 0 and T = t is identical to the 
event that the first vehicle has no defects and the total number of defects on the last n — 1 vehicles is 
t. Also, an mgf argument shows that T has a Poisson(n) distribution and the sum of the last n — 1 
X;’s has a Poisson distribution with parameter (n — 1)u. Thus 
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PG =ONT=1)_ P(X =00 De =) 
; = 


P(T =1) P(T =1) 
ji eH eH —Dieln—1) 
_ PX = 0)P (S779 Xi = 2) a a ee 
PIT =1) (ny)! n 


t! 


That is, the improved unbiased estimator is U* = (1-1/n)’. Though the variance of U* is difficult to 
derive, the Rao—Blackwell Theorem guarantees that its variance is no larger than that of U. 

If, for example, there are a total of t = 15 defects among n = 10 randomly selected vehicles, then 
the estimate is u* = (1 — 1/10)!° = 206. For this same sample, j1 = x = 1.5, so the maximum like- 
lihood estimate of e~“ is e~!° = .223. Here, as in some other situations, the principles of unbiased 
estimation and maximum likelihood are in conflict. However, if n is large, the improved estimate is 
(1 — 1/n)' = [(1 — 1/n)"|' © e~, which is the mle of e~“. That is, the unbiased and maximum 
likelihood estimators are “asymptotically equivalent.” a 


Further Comments 

The Rao—Blackwell Theorem also helps us limit the scope of possible estimators to consider for a 
given distribution. If the statistic U is purely a function of the sufficient statistic T (and doesn’t 
otherwise rely on the X;’s), then U and U* are the same—in a sense, there was nothing to improve. If 
U is not purely a function of 7, then the term E[V(U|T)] in the proof will be strictly positive, and so 
U* has strictly smaller variance than U. Said another way, for any statistic not based solely on a 
sufficient statistic, there exists some other estimator that is superior. 

For example, in Section 7.1 we looked at several potential estimators for the parameter 0 of a 
Uniform[0, 0] distribution, including max(X),...,X,,) and 2X. In fact, one could concoct an endless 
set of candidates—when asked, students often propose estimators of the form X+cS for some 
judicious choice c > 0. But we saw in Example 7.27 that max(X),...,X,) is sufficient for 0, so any 
statistic that is not purely a function of the sample maximum is necessarily inferior to some other 
estimator. Since neither X nor S can be completely determined by the sample maximum, any 
estimator relying on one or both of these should be rejected out of hand. 

We have emphasized that in general there will not be a unique sufficient statistic. Suppose there are 
two different sufficient statistics 7, and T> such that the first one is not a one-to-one function of the 
second (e.g., we are not considering T; = 5+ X; and T) = X). Then it would be distressing if we 
started with an unbiased estimator U and found that E(U | T,) 4 E(U | T2), so our improved estimator 
depended on which sufficient statistic we used. Fortunately there are general conditions under which, 
starting with a minimal sufficient statistic T, the improved estimator is the unique MVUE (minimum 
variance unbiased estimator). 

Maximum likelihood is by far the most popular method for obtaining point estimates, so it would 
be disappointing if maximum likelihood estimators did not make full use of sample information. 
Fortunately the mles do not suffer from this defect. If T,, ..., T, are jointly sufficient statistics for 
parameters 0), ..., 0,,, then the joint pmf or pdf factors as 


Fy + 65H OL, + 65 Om) = BC tty + + ys O15 - «+ Om) + A(H1, «+ +5 Xn) 


and the mles result from maximizing f(-) with respect to the 0,’s. Because the h(-) factor does not 
involve the parameters, this is equivalent to maximizing the g(-) factor with respect to the 0,’s. The 


resulting 0;’s will involve the data only through the ¢;’s. Thus it is always possible to find a maximum 
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likelihood estimator that is a function of just the sufficient statistic(s). There are contrived examples of 
situations where the mle is not unique, in which case an mle that is not a function of the sufficient 
Statistics can be constructed—but there is also one that is a function of the sufficient statistics. 

The concept of sufficiency is very compelling when an investigator is sure the underlying dis- 
tribution that generated the data is a member of some particular family (normal, exponential, etc.). 
However, two different families of distributions might each furnish plausible models for the data in a 
particular application, and yet the sufficient statistics for these two families might be different (an 
analogous comment applies to maximum likelihood estimation). For example, there are data sets for 
which a gamma probability plot suggests that a member of the gamma family would give a reasonable 
model and also a lognormal probability plot (normal probability plot of the logs of the observations) 
indicates that lognormality is plausible. Yet the jointly sufficient statistics for the parameters of the 
gamma family are not the same as those for the parameters of the lognormal family. When estimating 
some parameter @ in such situations (e.g., the mean or median j1), one would look for a robust 
estimator that performs well for a wide variety of underlying distributions, as discussed in 
Section 7.1. 


Exercises: Section 7.3 (38-50) 


38. The long run proportion of vehicles that factorization theorem to show that 5+ X; is a 
pass a certain emissions test is p. Suppose sufficient statistic for 2. 
that three vehicles are independently 4). Identify a pair of jointly sufficient statistics 
selected for testing. Let X; = 1 if the ith for the two parameters of a gamma distri- 
vehicle passes the test and X; = 0 other- bution based on a random sample of size 
wise (i= 1, 2, 3), and let T= X, + Xp + n from that distribution. 


X3. Use the definition of sufficiency to show 
that T is sufficient for p by obtaining the 
conditional distribution of the X;’s given 
that T = t for each possible value t. Then 


generalize by giving an analogous argu- 
ment for the case of n vehicles. 43. Messages are sent repeatedly across a noisy 


communication system until 7 arrive suc- 
cessfully. Let X = the number of transmission 
required, so that X has a negative binomial 
distribution with parameters r (known) and 
p (unknown). Determine a sufficient statistic 
for p based on a random sample Xj,...,Xn 
from this negative binomial distribution. 


42. Identify a pair of jointly sufficient statistics 
for the two parameters of a beta distribution 
based on a random sample of size n from 
that distribution. 


39. Components of a certain type are shipped in 
batches of size k. Suppose that whether or 
not any particular component is satisfactory 
is independent of the condition of any other 
component, and that the long run propor- 
tion of satisfactory components is p. Con- 
sider n batches, and let X; denote the 
number of satisfactory components in the 44. Suppose waiting time for delivery of an 


ith batch (i = 1, 2, ..., 1). Statistician A is item is uniform on the interval from 0, to 
provided with the values of all the X;’s, 02. Consider a random sample of n waiting 
whereas statistician B is given only the times, and use the factorization theorem to 
value of T= 5~>X;. Use a conditional show that the sample minimum and maxi- 
probability argument to decide whether mum are a pair of jointly sufficient statistics 
statistician A has more information about for 0; and 6. [Hint: Introduce an appro- 
p than does statistician B. priate indicator function as we did in 


40. Let X,,...,X, be a random sample of Example 7.27.] 
component lifetimes from an exponential  45- For 0 > 0 consider a random sample from a 


distribution with parameter 2. Use the uniform distribution on the interval from 0 
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46. 


47. 


48. 


7.4 


to 20, and use the factorization theorem to 
determine a sufficient statistic for 0. 


Suppose that survival time X has a log- 
normal distribution with parameters pz and o 
(which are the mean and standard deviation 
of In(X), not of X itself). Are >> X; and 
pec jointly sufficient for the two param- 
eters? If not, what is a pair of jointly suf- 
ficient statistics? 


The probability that any particular component 
of a certain type works in a satisfactory 
manner is p. If n of these components are 
independently selected, then the statistic X, 
the number among the selected components 
that perform in a satisfactory manner, is suf- 
ficient for p. You must purchase two of these 
n components for a particular system. Obtain 
an unbiased statistic for the probability that 
exactly one of your purchased components 
will perform in a satisfactory manner. [Hint: 
Start with the statistic U, the indicator func- 
tion of the event that exactly one of the first 
two components in the sample of size n per- 
forms as desired, and improve on it by con- 
ditioning on the sufficient statistic. ] 


In Example 7.30, we started with U = 
1(X, = 0) and used a conditional expecta- 
tion argument to obtain an unbiased esti- 
mator of the zero-defect probability based 
on the sufficient statistic. Consider now 
starting with a different statistic: U = 
So 1(X; = 0)/n. Show that the improved 
estimator based on the sufficient statistic is 
identical to the one obtained in the cited 
example. [Hint: Use the general property 


EY + ZT) = ED) + E(Z|D)1 
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49. 


50. 


7 Point Estimation 


In this section, it was established that }> X; 
and X are both sufficient statistics for esti- 
mating the parameter yu of a Poisson dis- 
tribution. We know that E(X) = yu, but it is 
also true that E(S*) = 0? = y for Poisson 
data. So, another unbiased estimator for yw is 
ju = (X +S’) /2. Which of these three esti- 
mators—X, S*, or ji—is the best choice for 
estimating 4? Why? 

A particular quality characteristic of items 
produced using a certain process is known to 
be normally distributed with mean yp and 
standard deviation 1. Let X denote the value 
of the characteristic for a randomly selected 
item. An unbiased estimator for the param- 
eter 0 = P(X < c), where c is a critical 
threshold, is desired. The estimator will be 
based on a random sample X), ..., Xy. 


a. Obtain a sufficient statistic for yu. 


b. Consider the estimator 0 = I(X,<c). 
Obtain an improved unbiased estimator 
based on the sufficient statistic (it is 
actually the minimum variance unbiased 
estimator). [Hint: You may use the fol- 
lowing facts: (1) The joint distribution 
of X, and X is bivariate normal with 
means “ and p, variances 1 and I/n, 
respectively, and correlation p (which 
you should determine). (2) If Y,; and Y> 
have a bivariate normal distribution, 
then the conditional distribution of Y; 
given that Y = yz is normal with mean 
Ly + (p61/62)(y2 — 2) and variance 


oi(1 — p)”.] 


In this section we introduce the idea of Fisher information and two of its applications. The first 
application is to find the minimum possible variance for an unbiased estimator. The second appli- 
cation is to show that the maximum likelihood estimator is asymptotically unbiased and normal (that 
is, for large n it has expected value approximately @ and it has approximately a normal distribution) 
with the minimum possible variance. 
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To motivate Fisher information, consider a rv Y ~ Bin(n, p) with p unknown, and imagine 
determining the mle of p from the log-likelihood function 


&(p) = | ("Jor -n"] = m(") + yIn(p) + (n — y) In(1 — p) 
y y 


Figure 7.7 presents a graph of ¢(p) for two cases: (n = 25, y = 19) and (n = 100, y = 76). By 
definition, the mle maximizes ¢(p); both functions graphed in Figure 7.7 achieve a maximum at .76, 
because the mle for the binomial model is p = y/n and here 19/25 = .76 = 76/100. But the curves are 
not identical; in particular, the graph for n = 100 is much more concave than the one for n = 25. From 
calculus, this means that the second derivative of £(p) has greater magnitude when n = 100 than 
when n = 25. Notice that in the vicinity of the local maximum, ¢(p) is concave down and so its 
second derivative is negative; the preceding observation can thus be restated as —@’(p) is larger when 
n is larger. 


n=25,y=19 


n= 100, y= 76 
-5 


/ 


Log-likelihood 


Figure 7.7 Binomial log-likelihood functions for n = 25 and n = 100 


What does any of this have to do with “information”? Intuitively, a sample of size n = 100 
contains more information than does a sample of size n = 25. Statistician R. A. Fisher was one of the 
first to notice the connection between sample size and the concavity of the log-likelihood function, 
leading to the following definition. 
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DEFINITION Let f(x; 0) denote a pmf or pdf. The Fisher information (0) in a single observation 
X from f(x; 0) is defined by 


2 
1(0) = z|- = In(f(X; ) (7.5) 


Partial derivative notation is used in (7.5) to emphasize that the log-likelihood function depends on 
both X and 0. Since X is a random variable in the definition ¢(0) = Inf(X; 0), €(@) and its derivatives 
with respect to @ are also random variables. Thus (7.5) can be reexpressed as /(0) = E|—£"(0)]. 


Example 7.31 Let X be a Bernoulli rv, so f(x; p) = p*(I-p)'™, x = 0, 1. Then the second derivative 
of the log-likelihood function is 


2 2 2 x —x 
t" (p) = a Infp*(1 p)'* = 5 [x In(p)] + a [(1 x) In(1 P)| = 2 


To calculate Fisher information, multiply by —1, replace x with X, and calculate the expected value of 
the resulting expression: 


x 1-x 


a. E(X) 1-—E(X) p l—p 1 
pe (1—py’ 


I(p)=E 


Po (1=p) PP (1=p)? PLP) 


The denominator of this expression is maximized when p = .5, so the Fisher information in a single 
Bernoulli trial is smallest when p = .5 and increases as p approaches 0 or 1. Hi 


Example 7.32 Suppose X has the pdf f(x; 0) = 0x"! forO0 < x < 1. Then 


ee om 1 1 
+ —, |(0- 1) In(@®)] = -—= +0=-— 
Ae? [( ) ( )) 07 @7 
Since x does not appear in the second derivative, Fisher information here is easily determined to be 
1(0) = E[-(-1/07)] = 1/07. | 


Although Expression (7.5) is often the computationally simplest method for determining Fisher 
information, it is useful to have an alternative expression for J(0). 


PROPOSITION 9 
1(0) = vain F(X: a| (7.6) 


provided that the order of the partial derivative and expectation operations in the 
definition of Fisher information can be interchanged. Critically, for this inter- 
change to be valid, the support of the distribution (i.e., the set of possible 
x values) cannot depend on 0. 
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The quantity 2,1Inf(X; 0) that appears in (7.6) is referred to as the score function and will shortly play 
an important role. The score function is simply ¢’(0), the first derivative of the log-likelihood 
function, treated as a random variable. Under that perspective, the foregoing proposition can be 
restated as 1(0) = V(¢’(0)). 


Proof The proof presented here assumes a discrete distribution; for the continuous case, replace the 
summations below with integrals. We first establish that, under the assumptions of the proposition, 
the score function has expected value equal to zero; this fact will prove useful in its own right. By the 
law of the unconscious statistician, 


E(f’(0)] = els In(f(X; 0) )| = Dain x; 0)) «f(x; 0) 
gels) 9 | 
= ie FO = Ligh) = Hono 


d 
= Sil 
=0 


because every pmf sums to | 


To establish the equivalency of (7.5) and (7.6), take another derivative, which must also be 0 since 


E|e’(0)] =0 
0 = Fe EU (0 = FD ppl) F050) = Do55 [spine ) F650) 


a 0 
-> ange 0) Fs 0) + Zin(e(x;0)) SFC a| 


2 F(x: 
> » | 5 In(f (x; 0)) «f(x; 0)| + > Sin(fl: 0))- oe = f(x; 0| 
2 anf (X; 8) 
| nex ) +E Sin(f(X: 0) - "Ky 


= =1(8) + | Smtr (x50) - (r(x 0))] = -1(6) + CEP 


Therefore, 1(0) = E[{@’(0)}7] = V(l(0)) + {E[@(@)]}’ = V(¢(0)) +02 = V(¢(6)), completing the 
proof. fl 


Example 7.33 Reconsider the single Bernoulli observation X from Example 7.31. The score 
function is 


x 120 . Rap 
p 1-p_ p(l—p) 


€(P) = gin f(Xip)) =F [KInp+ (1 ~X) n(t ~ p= 
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[This expression indeed has mean zero, as indicated in the foregoing proof.] Apply (7.6): 


_»| t-p | Va-p) Vex) .pl—p 1 
— bu D ~ p-p)P pap)? ba-pyP PAP)’ 
which agrees with the result in Example 7.31. H 


In principle, the same method could be applied to the pdf from Example 7.32, for which the score 
function is 


— : + In(X) 


2 (In(@) + (8 — 1) In(X)] r 


ae 


Then Fisher information could be calculated via (7.6): 
1 
1(0) = v5 + n(x) = V(In(X)) 


While the calculus to determine the variance of In(X) is not insurmountable, the method using (7.5) 
shown in Example 7.32 is computationally much easier. 


Information in a Random Sample 

The definition of Fisher information extends to n rvs X,,...,X,; simply replace f(X; 0) in (7.5) or 
(7.6) with the joint pmf/pdf of the X;’s. When X),...,X, represent a random sample from some 
distribution f(x; 0), the Fisher information in the sample can be easily computed from the information 
in a single observation. 


ADDITIVE PRINCIPLE Let X;,...,X, be arandom sample from a distribution with pmf or 

OF INFORMATION pdf f(x; 0). Then the Fisher information in X,,...,X, is simply 
n times the Fisher information in a single observation. That is, if 
I,(0) denotes the Fisher information in the sample, then 


I,(0) =n-1(8), 


where /(0) denotes the Fisher information in a single observation 
from f (x; 0). 


Proof Since the X;’s form a random sample, f(x1,...,%n;0) =f(%1;0)----- f(%n; 0). The 
result then follows from simple linearity properties: 
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The Additive Principle of Information makes sense intuitively, because it says that twice as many 
observations yield twice as much information. This property also saves us the hassle of constructing 
large joint pmf/pdf expressions. The aforementioned connections between Fisher information and the 
log-likelihood function still apply: with the log-likelihood ¢(0) = Inf(X,,...,X,;0) regarded as a 
random variable, 


In(0) = E[-€"(0)| = V(E()) 


Example 7.34 Continuing with Example 7.31, let X,, X2, ..., X, be a random sample from a 
Bernoulli distribution. We saw that the information in a single observation is [(p) = 1/[p(1 — p)], and 
therefore the Fisher information in the random sample is J,(p) = nI(p) = n/[p(. — p)]. 

Astute readers will notice that Fisher information here is exactly the reciprocal of the variance of P, 
which is the mle of p for Bernoulli data. As we’ll see later in this section, this is not a coincidence. Ml 


The Cramér-Rao Inequality 

We now use the concept of Fisher information to show that if a statistic is an unbiased estimator of 
0, then its minimum possible variance is the reciprocal of /,(0). Harald Cramér in Sweden and 
C. R. Rao in India independently derived this inequality during World War II, but R. A. Fisher had 
some notion of it 20 years previously. 


CRAMER-RAO [et Xj,...,X, be a random sample from the distribution with pmf or pdf 
INEQUALITY fix; 0) whose support does not depend on 0. If the statistic T = (X1, ..., Xn) 
is an unbiased estimator of the parameter 0, then 


Vi) 2 = 


Proof The clever idea here is to consider the correlation p between T and the score function and 
exploit the fact that -—1 < p < 1. We will need the fact from an earlier proof in this section that 
the mean of the score function is zero: E[¢’(0)] = 0. Using this fact and the covariance expression 
Cov(X, Y) = E(XY) — E(X)E(Y), the covariance of T and the score function ¢'(0) is 


7 mF (K+ Xni 8) 


Cov(T, £(0)) = E(T-€(0)) 0 = E] P- rIn/(Xi.. Xi 0) =e| f(X%,...X,;0) 


2 are er 
- S- tty y << yty) ML) es as 8) 


Micky Fis sey Xns 0) 
_ x t(x} a) me was Xn; 0) 
Hi) elk oo ™ 
d d 
do t(x1, ah ey »Xn3 0) WE) 
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If T = t(X,, Xo, ..., X,) is an unbiased estimator of 0, then E(T) = 0, so the derivative in the last 
expression is just 1, from which we deduce that Cov(T, ¢’(0)) = 1. 

Now recall from Section 5.2 that the correlation between two rvs X and Y is 
px = Cov(X, Y)/(oxay). Therefore, 


Cov(X, Y)* = Px yIxOy < lojoy = V(X)V(Y) 
Apply this to T and the score function ¢’(@): 
1 = Cov(T, (0)? < V(T)V(E(0)) = V(T) -In(8), 
and the desired inequality follows. a 


Because the variance of T must be at least 1//,,(0), it is natural to call T an efficient estimator of 0 if 


VT) = 1/1,(0). 


DEFINITION Let T be an unbiased estimator of 0. The efficiency of T is the ratio of the Cramér— 
Rao lower bound to the variance of T: 
1/T,(0) 1 


ffici f T= a 
efficiency o V(T) VT) -IA(0) 


T is said to be an efficient estimator of @ if T achieves the lower bound (so its 
efficiency is 1; otherwise, efficiency will be less than 1). An efficient esti- 
mator is a minimum variance unbiased (MVUE) estimator, as discussed in 
Section 7.1. 


Example 7.35 (Example 7.34 continued) Let X,,...,X, be a random sample from a Bernoulli 
distribution. We saw that the Fisher information in the sample is /,,(p) = n/[p(1 — p)], and therefore the 
Cramér—Rao lower bound on the variance of any unbiased estimator of p is 1/I,(p) = pU — p)/n. Let 
T=P= >> X;/n, the sample proportion of successes. It was established in Example 7.4 that P is an 
unbiased estimator of p and that V(P) = p(1 — p)/n. Because T is unbiased and V(T) is equal to the 
lower bound, T has efficiency | and therefore it is an efficient estimator. o 


The Cramér—Rao inequality can be generalized to an estimator whose expected value is not 0 itself 


but rather some function 4(0). Using a similar proof, it can be shown (under the same requirements 
about the pmf/pdf) that the lower bound on the variance of any statistic T with mean h(@) is 


VitjS 


In the special case that T is unbiased for 0, then h(0) = 0, h'(0) = 1, and we have the original 
Cramér—Rao inequality. (See Exercises 59-60 for applications of this more general result.) 
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Large-Sample Properties of the MLE 


As mentioned briefly in Section 7.2, the maximum likelihood estimator 0 has some nice properties. 
First of all it is consistent, which means that it converges in probability to the parameter 0 as the 
sample size increases. A verification of this is beyond the level of this book, but we can use it as a 
basis for showing that the mle is asymptotically normal with mean 0 (asymptotic unbiasedness) and 
variance equal to the Cramér—Rao lower bound. 


THEOREM Let X1,...,X, be a random sample from a distribution whose support does 


not depend on @. Then for large n the maximum likelihood estimator 0 has 
approximately a normal distribution with mean @ and variance 1/[n/(0)]. 


A proof of this result appears in the appendix to this chapter. 


Example 7.36 It was established in Example 7.16 that the mle of p when sampling from a Bernoulli 
distribution is the sample proportion P = >> X;/n. Recall from Example 7.35 that P is unbiased and 
efficient with the minimum variance of the Cramér—Rao inequality. Finally, P is asymptotically 
normal by the Central Limit Theorem. These properties are in accord with the asymptotic distribution 
given by the theorem, P~ N(p, 1/[nI(p)}). a 


Example 7.37 (Example 7.32 continued) Consider a random sample X,,...,X,, from the distribution 

with pdf f(x; 0) = @x°-! forO < x < 1. The Fisher information in a single observation was found to 

be [(@) = 1/07. The maximum likelihood estimator of 0 (see Exercise 27 for a similar example) is 
-1 


9= Sion Oa 


The expected value of In(X) for this distribution is —1/0, so the denominator of (7.7) converges in 
probability to -1/0 by the Law of Large Numbers. Therefore 0 converges in probability to 0, which 
means that 0 is consistent. (We knew this because the mle is always consistent, but it is also nice to 
show it directly.) Determining the exact distribution of 0 is quite difficult. However, by the preceding 


theorem, for large n the distribution of 0 is approximately normal, with mean @ and variance 
1/[nI(0)| = 07 /n. | 


Sufficiency and Efficiency 
As we discussed in Section 7.3, the Rao—Blackwell Theorem implies that any estimator not based 
purely on sufficient statistics is necessarily inferior (has greater variance than) some other statistic. So, 
we cannot expect a statistic to be an efficient estimator without first being sufficient. 

The proof of the Cramér—Rao inequality considered the correlation between two random variables: 
a statistic T and the score function ¢’(0). The inequality followed from the fact that |p| < 1, but we 
know from Chapter 5 that equality can only occur—that is, |p| = 1—if the two rvs are linear 
functions of each other. So, suppose a statistic T = t(X\,...,X,) is sufficient for 0. By the Factor- 
ization Theorem, the score function may then be written as 
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£(O\= Sing(a, 133 %nf0) = inlet 0) - h(x1,.--;Xn)] 


) () 6) 
= 5 inlg(ts0)) + Sa infer, 0] = Satna eCx1,---%4): 8) 
Therefore, an estimator can only be efficient if it is a linear function of  infg(e(%, ..-;Xn); 0)]. In 
particular, an efficient estimator can only depend on X,,...,X, through the sufficient statistic 


T = 1(X1,...,X,). This result is consistent with the Rao—Blackwell Theorem. 


Exercises: Section 7.4 (51-60) 


51. The number of attempts required to suc- b. Find the Cramér—Rao lower bound for the 
cessfully transmit a message across a noisy variance of an unbiased estimator of 0. 
channel can be modeled by a geometric c. In Examples 7.9 and 7.10, two unbiased 
distribution, whose pmf is (1 — py! p for estimators for 0 were proposed, one with 
x=1, 2, 3, .... To estimate the unknown variance 07 /[n(n+2)] and another with 
parameter p we obtain the random sample variance 67 /(3n). Compare these vari- 
X,, Xz, ..., X, from this geometric distri- ances to part (b) and explain why they 
bution. seem to contradict the Cramér—Rao 
a. Find the Fisher information in a single inequality. What assumption is violated, 

observation X using both (7.5) and (7.6). causing the inequality not to apply here? 


b. What is the Fisher information in the 
random sample? 

c. Determine the Cramér—Rao lower bound 
for the variance of an unbiased estimator 


54. Survival times have the exponential distri- 
bution with pdf f(x; 4) = Ae ~** for x > 0, 
where 2>0 is unknown. However, we 
wish to estimate the mean yu = 1// based on 


ony the random sample Xj, Xo, ..., Xn, so let’s 
52. Assume that the number of alpha particles re-express the pdf in the form (I/we. 
emitted in one second by a particular a. Find the information in a single obser- 
radioactive source has a Poisson distribu- vation and the Cramér—Rao lower 
tion with parameter . Consider estimating bound. 
u based on a random sample X1, Xo, ..., Xn. b. Determine the mle of yu. 
a. Find the Fisher information in a single c. Find the mean and variance of the mle. 
observation using both (7.5) and (7.6). d. Is the mle an efficient estimator? Explain. 


b. Find the Cramér—Rao lower bound for 
the variance of an unbiased estimator of 
M. 

c. Determine the mle of « and show that 
the mle is an efficient estimator. 

d. Is the asymptotic distribution of the mle 
in accord with the last theorem of this 
section? Explain. 


55. Let X,, Xo, ..., X, be a random sample from 
the normal distribution with known stan- 
dard deviation o. 

a. Find the mle of . 

b. Find the distribution of the mle. 

c. Is the mle an efficient estimator? Explain. 

d. How does the answer to part (b) com- 
pare with the asymptotic distribution 


53. Let X1,...,X, be arandom sample from the given by the second theorem? 
Uniform[0, 0] distribution. 56. Let X,, Xo, ..., X, be a random sample from 
a. Use the expression (0) = E[(0(0))"] to the normal distribution with known mean 1 
determine the Fisher information in a but with the variance o* as the unknown 


single observation from this distribution. parameter. 
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57. 


58. 


59. 


60. 
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a. Find the Fisher information for o* in a 
single observation and the Cramér—Rao 
lower bound. 

b. Find the mle of o°. 

Find the distribution of the mle. 

d. Is the mle an efficient estimator? 
Explain. 

e. Is the answer to part (c) in conflict with 
the asymptotic distribution of the mle 
given by the second theorem? Explain. 


© 


Let X1, X>, ..., X, be a random sample from 

the normal distribution with known mean 

but with the standard deviation o as the 

unknown parameter. 

a. Find the Fisher information for o in a 
single observation. 

b. Compare the answer in part (a) to the 
answer in Exercise 56(a). Does the infor- 
mation depend on the parameterization? 


Let X1, X>, ..., X, be a random sample from 
a continuous distribution with pdf f(x; 6). 
For large n, the variance of the sample 
median is approximately 1/{4n[f(ji; 0)|’}. 
If Xi, Xz, ..., X, is a random sample from 
the normal distribution with known stan- 
dard deviation o and unknown jp, determine 
the efficiency of the sample median. 


Return to the geometric distribution from 

Exercise 51. Let X1,...,X, be a random 

sample from this distribution, and let X 

denote the sample mean. 

a. Determine the expected value and vari- 
ance of X as functions of p. 

b. Using the generalization of the Cramér— 
Rao inequality presented in this section, 
determine the lower bound for the vari- 
ance of any estimator whose expectation 
is equal to E(X) from part (a). 

c. Is X an efficient estimator of its expec- 
tation? 

Return to the exponential distribution from 

Exercise 54. Let X,,...,X, be a random 


sample from this distribution, and let X 
denote the sample mean. 
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a. Find the Fisher information for 2 in a 
single observation from this distribution. 

b. Determine the expected value and vari- 
ance of X as functions of /. 

c. Using the generalization of the Cramér— 
Rao inequality presented in this section, 
determine the lower bound for the vari- 
ance of any estimator whose expectation 
is equal to E(X) from part (b). Is X an 
efficient estimator of its expectation? 

d. Does it follow that 1/X is the MVUE of 
2? Why or why not? 


Supplementary Exercises: (61—78) 


61. 


62. 


At time ¢ = 0, there is one individual alive 
in a certain population. A pure birth pro- 
cess then unfolds as follows. The time until 
the first birth is exponentially distributed 
with parameter 2. After the first birth, there 
are two individuals alive. The time until the 
first gives birth again is exponential with 
parameter 2, and similarly for the second 
individual. Therefore, the time until the next 
birth is the minimum of two exponential (A) 
variables, which is exponential with 
parameter 2/4. Similarly, once the second 
birth has occurred, there are three individ- 
uals alive, so the time until the next birth is 
an exponential rv with parameter 31, and so 
on (the memoryless property of the expo- 
nential distribution is being used _ here). 
Suppose the process is observed until the 
sixth birth has occurred and the successive 
birth times are 25.2, 41.7, 51.2, 55.5, 59.5, 
61.8 (from which you should calculate the 
times between successive births). Derive the 
mle of 1. [Hint: The likelihood is a product 
of exponential terms.] 


Let X,,...,X, be a random sample from a 
Uniform[0, 6] distribution, and let Y,, 
denote the largest observation: Y, = 
max(X),...,X,). [This is the rv denoted 0, 
in several examples in Section 7.1.] 
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63. 


64. 


65. 


66. 


a. Show that the pdf of Y,, is 


n—1 


ny 


O0<y<0 


[Hint: Use the methods of Section 5.7, 
or use the relationship F(y) = 
P(Y <y) = PQ@i <ys*-0X,<y).] 

b. Use part (a) to determine the mean and 
variance of Y,,. 


The proportion of iron in rock specimens 
from a certain quarry is assumed to follow a 
standard beta distribution with unknown 
parameters « and f. Suppose the following 
observations are made on a sample of n = 6 
specimens: .873, .437, .249, .712, .501, 
.618. Calculate the method of moments 
estimates for « and f. [Hint: Be careful in 
determining the formula for E(X°).] 


Let X,, ..., X , be a random sample from a 
uniform distribution on the interval [—0, 0]. 


a. Determine the mle of 6. [Hint: Look 
back at what we did in Example 7.23.] 

b. Give an intuitive argument for why the 
mle is either biased or unbiased. 

c. Determine a sufficient statistic for 0. 
[Hint: See Example 7.27.] 

d. Use the results of Section 5.7 to deter- 
mine the joint pdf of the smallest order 
statistic Y, and the largest order statistic 
Y,. Then use it to obtain the expected 
value of the mle. [Hint: Draw the region 
of joint positive density for Y, and Y,, 
and identify what the mle is for each 
part of this region.] 

e. What is an unbiased estimator for 0? 


Carry out the details for minimizing MSE 
in Example 7.8: show that c= I/(n + 1) 


minimizes the MSE of 6? = c )7(X; — X)” 


when the population § distribution is 
normal. 
Let X;, ..., X, be a random sample from a 


pdf that is symmetric about “. An estimator 
for 4 that has been found to perform well 
for a variety of underlying distributions is 
the Hodges—Lehmann estimator. To define 


67. 


68. 


69. 


70. 
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it, first compute for each i < j and each 
j=1, 2, ..., nm the pairwise average 
Xi; = (X;+-X;)/2. Then the estimator is 
ji = the median of the X;;’s. Compute the 
value of this estimate using the data of 
Exercise 53 of Chapter 1. [Hint: Construct a 
square table with the ~x,’s listed on the left 
margin and on top. Then compute averages 
on and above the diagonal.] 


For a normal population distribution, the 
Statistic 


& = median(|X; — X|,..., |X, — X|)/.6745 


can be used to estimate o. This estimator is 
more resistant to the effects of outliers than 
is the sample standard deviation. Compute 
both the corresponding point estimate and 
s for the data of Example 7.2. 


When the sample standard deviation S' is 
based on a random sample from a normal 
population distribution, it can be shown that 


E(S) = V2/(n— IV (n/2)o/T[(n — 1)/2] 


Use this to obtain an unbiased estimator for 
o of the form cS. What is c when n = 20? 


Each of n specimens is to be weighed twice 
on the same scale. Let X; and Y; denote the 
two observed weights for the ith specimen. 
Suppose X; and Y; are independent of each 
other, each normally distributed with mean 
value ju; (the true weight of specimen 7) and 
variance 0°. 


a. Show that the mle of oa’ is 
62 = 37 (X;—Y;)°/(4n). (Hint: If 
Z=(zi+z)/2, then S\(z;—- z= 
(a1 — z2)°/2.] 


b. Is the mle G? an unbiased estimator of 
o°? Find an unbiased estimator of o”. 
[Hint: For any rv Z, E(Z’) =V(Z) + 
[E(Z)|°. Apply this to Z = X; — Y;.] 


For 0 <6 <1 consider a random sample 
from a uniform distribution on the interval 
from 0 to 1/0. Identify a sufficient statistic 
for 0. 


Supplementary Exercises 


71. 


72. 


73. 


Let p denote the proportion of all individ- 
uals who are allergic to a particular medi- 
cation. An investigator tests individual after 
individual to obtain a group of r individuals 
who have the allergy. Let X; = 1 if the ith 
individual tested has the allergy and X; = 0 
otherwise (i = 1, 2, 3,...). Recall that in 
this situation, Y = the number of individu- 
als tested to obtain the desired group has a 
negative binomial distribution. Use the 
definition of sufficiency to show that Y is a 
sufficient statistic for p. 


The fraction of a bottle that is filled with a 
particular liquid is a continuous random 
variable X with pdf f(x;0) = 0x’! for 
0 <x < 1 (where 0 > 0). 


a. Obtain the method of moments estima- 
tor for 0. 

b. Is the estimator of (a) a sufficient 
statistic? If not, what is a sufficient 
statistic, and what is an estimator of 0 
(not necessarily unbiased) based on a 
sufficient statistic? 


Let X;, ..., X, be a random sample from a 
normal distribution with both p and o 
unknown. An unbiased estimator of 0 = 
P(X < _c) based on the jointly sufficient 
nj(n—1) 
and w = (c — jt)/@. Then it can be shown 
that the minimum variance unbiased esti- 
mator for @ is 


statistics is desired. Let k = 


0 kw< -1 

jy — kwV/n—-2 

d= P(E a —l<kw<l 
1 kw>1 


where T has a ¢ distribution with n — 2 df. 
The article “Big and Bad: How the S.U.V. 
Ran over Automobile Safety” (The New 
Yorker, Jan. 24, 2004) reported that when 
an engineer with Consumers Union (the 
product testing and rating organization that 
publishes Consumer Reports) performed 
three different trials in which a Chevrolet 
Blazer was accelerated to 60 mph and then 


74. 
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suddenly braked, the stopping distances 
(ft) were 146.2, 151.6, and 153.4, respec- 
tively. Assuming that braking distance is 
normally distributed, obtain the minimum 
variance unbiased estimate for the proba- 
bility that distance is at most 150 ft, and 
compare to the maximum likelihood esti- 
mate of this probability. 


Here is a result that allows for easy iden- 
tification of a minimal sufficient statistic: 
Suppose there is a function 7(%, ..., X,) 
such that for any two sets of observations 
X1, ...,X, and yy, ..., y,, the likelihood ratio 


Sy, es Xs OVO, --5 Yui 8) doesn’t 
depend on 0@ if and only if tx, ..., x,) = 
(1, ---, Yn). Then T = (Xj, ..., X,) is a 


minimal sufficient statistic. The result is 
also valid if 0 is replaced by 61, ..., 0,,, in 
which case there will typically be several 
jointly minimal sufficient statistics. For 
example, if the underlying pdf is exponen- 
tial with parameter 2, then the likelihood 
ratio is A>*'->’, which will not depend on 2 
if and only if }> x; =~ y;, so T = 554; is 
a minimal sufficient statistic for 2 (and so is 
the sample mean). 


a. Identify a minimal sufficient statistic 
when the X;’s are a random sample from 
a Poisson distribution. 

b. Identify a minimal sufficient statistic or 
jointly minimal sufficient statistics when 
the X;’s are a random sample from a 
normal distribution with mean @ and 
variance 0. 

c. Identify a minimal sufficient statistic or 
jointly minimal sufficient statistics when 
the X;’s are a random sample from a 
normal distribution with mean @ and 
standard deviation 0. 


The principle of unbiased estimation has 
been criticized on the grounds that in some 
situations the only unbiased estimator is 
patently ridiculous. Here is one such 
example. Suppose that the number of 
blemishes X on a randomly selected piece 
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76. 


77. 


of fruit has a Poisson distribution with 
parameter . You are going to purchase two 
such pieces of fruit and wish to estimate 
0 =e-*#, the probability that neither of 
these has any blemishes. But your estimate 
is based on observing the value of X for a 
single piece. Obtain an estimator 0 = d(X) 
that is unbiased for 0; i.e., such that 
E|d(X)] = e-7. [Hint: Set the summation 
for E[d(X)] equal to e~?", cancel e~“ from 
both sides, then expand what remains on 
the right-hand side in a Taylor series and 
compare the two sides to determine d(X).] 
If X = 200, what is the estimate? Does this 
seem reasonable? What is the estimate if 
X = 199? Is this reasonable? 


Let X, the payoff from playing a certain 
game, have pmf 


fit ax: 0 x=-1 
ps8) ={ 4 oe x=0,1,2,... 


a. Verify that p(x; 0) is a legitimate pmf, 
and determine the expected payoff. 
[Hint: Look back at how the properties 
of a geometric random variable were 
developed in Chapter 3.] 

b. Let Xj,..., X, be the payoffs from n in- 
dependent games of this type. Deter- 
mine the mle of 0. [Hint: Let Y denote 
the number of observations among the 
n that equal -1; that iS, 
Y = )01(X; = —1), where (A) = 1 if 
A occurs and 0 otherwise. Then, write 
the likelihood as a single expression in 
terms of 5> x; and y.] 

c. What is the approximate variance of the 
mle when 7 is large? 


Regression through the origin. Let x denote 
the number of items in an order and y de- 
note time (min) necessary to process the 
order. Processing time may be determined 
by various factors other than order size. So 
for any particular value of x, we now regard 
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the value of total production time as a 
random variable Y. Consider the following 
data obtained by specifying various values 
of x and determining total production time 
for each one. 


x 10 15 18 20 25 
y 301 455 533 599 750 
x 27 30 35 36 40 


810 903 1054 1088 1196 


a. Plot the observed (x, y) pairs on a two- 
dimensional coordinate system. Do all 
points fall exactly on a line passing 
through (0, 0)? Do the points tend to fall 
close to such a line? 

b. Consider the following probability 
model for the data. Values x1, x2, ..., X, 
are specified, and at each x; we will 
observe a value of the dependent vari- 
able Y;. Assume that the Y,,...,Y, are 
independent and normally distributed, 
with Y; having mean value fx; and 
variance o?. That is, rather than assume 
that y = fx, a linear function of x pass- 
ing through the origin, we are assuming 
that the mean value of Y is a linear 
function of x and that the variance of 
Y is the same for any particular x value. 
Obtain formulas for the maximum 
likelihood estimates of B and o*, and 
then calculate the estimates for the 
given data. How would you interpret the 
estimate of (? What value of processing 
time would you predict when x = 25? 
[Hint: The likelihood is a product of 
individual normal pdfs with different 
mean values and the same variance. 
Proceed as in the estimation via maxi- 
mum likelihood of the parameters 1 and 
o° based on a random sample from a 
normal population distribution.] 


78. Reconsider the “regression through the 


origin” situation presented in the previous 
exercise. Consider the following three 
estimators for the slope parameter f (one of 
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which is the mle obtained in the previous a. Show that all three of these estimators 
exercise): are unbiased for f. 
b. Determine the variance of all three 
B= wi 2 _ 3 Yj B _ eee estimators, and comment on what you 
Dee a. Xj : SS a find. 


Proof of the Asymptotic Distribution of the MLE 
Let @ denote the mle of 0, and consider again the score function ('(0) = ain f(X,X2,...,Xnj 0). Its 


derivative ¢’(@) at the true parameter value @ is approximately equal to the following difference 
quotient: 


(0) — (0) 


0-0 


>| 


0" (0) & (7.8) 
Moreover, the error in Equation (7.8)—1.e., the difference between the two sides of the + sign— 
approaches zero as n — oo because 0 approaches 0 (consistency). Now, because @ is the mle, by 


definition /(0) = 0, and (7.8) can be re-arranged to write 


=f) - 1-28) 


09x = va(a—0) ~O) ve (8) (7.9) 


Similar to the proof of the additive principle of information, the denominator may be written as 


1 -e"(0)) =7|(- ia Inf (Xi; ®) te +(- = inf(%i0)) 


n ow aw 


the average of n iid random variables each with mean (0). Therefore, by the Law of Large Numbers, 
the denominator converges to (0). At the same time, the numerator of (7.9) is 


at) = ci (Spine 0) feet (Sint )| 


The terms in parentheses are also iid, each with mean 0 (the mean of the score function is zero) and 
variance 1(0) by (7.6). It follows from the Central Limit Theorem that the numerator converges to a 


normal rv with mean 0 and standard deviation ,/7(6). 
Combining these two results, the ratio on the right-hand side of (7.9) is approximately normal with 


mean 0 and standard deviation \/I(0)/1(0) = 1/,/1(0). That is, \/n(@— 0) is approximately 


N(0, 1/,/7(0)), and it follows that 0 is approximately 2 with mean 0 and variance 1/[n/(0)], the 
Cramér—Rao lower bound. i 


®) 


Check for 
updates 


Introduction 

A point estimate, because it is a single number, by itself provides no information about the precision 
and reliability of estimation. Consider, for example, using the statistic X to calculate a point estimate 
for the true average breaking strength of a certain brand of paper towels, and suppose that x = 9322.7 
grams. Because of sampling variability, it is virtually never the case that x = yu. The point estimate 
says nothing about how close it might be to . An alternative to reporting a single sensible value for 
the parameter being estimated is to calculate and report an entire interval of plausible values—an 
interval estimate or confidence interval (CI). 

A confidence interval is calculated by first selecting a confidence level, which is a measure of the 
degree of reliability of the interval. A confidence interval with a 95% confidence level for the true 
average breaking strength might have a lower limit of 9162.5 and an upper limit of 9482.9. Then at 
the 95% confidence level, any value of 4 between 9162.5 and 9482.9 g is plausible. The higher the 
confidence level, the more strongly we believe that the value of the parameter being estimated lies 
within the interval (an interpretation of any particular confidence level will be given shortly). 

Information about the precision of an interval estimate is conveyed by the width of the interval. If 
the confidence level is high and the resulting interval is quite narrow, our knowledge of the value of 
the parameter is reasonably precise. A very wide confidence interval, however, gives the message that 
there is a great deal of uncertainty concerning the value of what we are estimating. Figure 8.1 shows 
95% confidence intervals for true average breaking strengths of two different brands of paper towels. 
One of these intervals suggests precise knowledge about 1, whereas the other suggests a very wide 
range of plausible values. 


Brand 1: - - Strength 


Brand 2: - Strength 


Figure 8.1 Confidence intervals indicating precise (Brand 1) and imprecise (Brand 2) information about 
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452 8 Statistical Intervals Based on a Single Sample 
8.1 Basic Properties of Confidence Intervals 


The basic concepts and properties of confidence intervals (CIs) are most easily introduced by first 
focusing on a simple, albeit somewhat unrealistic, problem situation. Suppose that the parameter of 
interest is a population mean y and that 


1. The population distribution is normal. 
2. The value of the population standard deviation ¢ is known. 


Population normality is often a reasonable assumption and can be checked by examining a normal 
probability plot of the sample data. However, if the value of jz is unknown, it is unlikely that the value 
of o would be available (knowledge of a population’s center typically precedes information con- 
cerning spread). In later sections, we will develop methods based on less restrictive assumptions. 


Example 8.1 Titanium alloys are used in everything from offshore oil operations to toys (remember 
the fidget spinner?). The article “Statistical Analysis of Tensile Strength and Elongation of Pulse TIG 
Welded Titanium Alloy Joints Using Weibull Distribution” (Cogent Engr. 2016) described an 
experiment designed to study various characteristics of a certain type of weld. A total of n = 31 
experimental runs resulted in a sample mean tensile strength of x = 1064 MPa, and the data suggests 
that tensile strength measurements can be modeled with a normal distribution (despite the Weibull 
reference in the article’s title!). Assuming the population standard deviation for tensile strength of these 
welds is o = 55 MPa (a value suggested by data in the article), we will see shortly how to obtain an 
interval of plausible values for pu, the true average tensile strength of all such titanium alloy welds. 


The actual sample observations x, x2, ..., x, are assumed to be the result of a random sample 
Xi, .... X, from a NM(w,c) distribution. The results of Chapter 6 then imply that the sample mean 
X is normally distributed, with expected value yz and standard deviation o/,\/n. Standardizing 
X by first subtracting its expected value and then dividing by its standard deviation yields the 
variable 


_X=-uH 
“= Gin 


Then Z has a standard normal distribution. Because the area under the standard normal curve between 
—1.96 and 1.96 is .95, 


(8.1) 


Pl ieee" 2166 | 85 (8.2) 
: ae = . 
The next step in the development of our CI is to manipulate the inequalities inside the parentheses in 
(8.2) so that they appear in the equivalent form / < js < u, where the endpoints J and u involve X and 


o//n. Multiplying all terms in the inequalities by o/,/n, subtracting X from each term, and then 
multiplying through by —1 (to eliminate the negative sign in front of ju) gives 


X06 2S pers ioe 
n 


yn vn 


These endpoints also result from replacing each < by = in (8.2) and solving for y. 
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Because this last set of inequalities is equivalent to those inside (8.2), it follows that 


P(X - 196-5 <u <X+196-5) = 95 (8.3) 
The event inside the parentheses in (8.3) has a somewhat unfamiliar appearance. Previously, the 
random quantity has appeared in the middle with constants on both ends, asina < Y < b. Butin 
(8.3) the random quantity appears on the two ends and the unknown constant 4 appears in the middle. 
To interpret (8.3), think of a random interval having left endpoint X — 1.96 - o/,/n and right endpoint 
X + 1.96 - @/,/n, which in interval notation is 


= GC = o 
(x- 1.96. X+ 196-5) (8.4) 
The interval (8.4) is random because the two endpoints of the interval involve a random variable. 
Note that the interval is centered at the sample mean X and extends 1.96 - a/,/n to each side of X. 
Thus the interval’s width is 2 - 1.96 - ¢/,/n, which is not random; only the location of the interval, its 
midpoint X, is random (see Figure 8.2). Now (8.3) can be paraphrased as “the probability is .95 that 
the random interval (8.4) includes or covers the true value of w.” Before any experiment is performed 
and any data is gathered, it is quite likely (probability .95) that y will lie inside the interval in 
Expression (8.4). 


1.960/Vn — 1.960/./n 
—_——_——= ooo 


X -1.960/Vn X X +1.960/ Vn 


Figure 8.2 The random interval (8.4) centered at X 


DEFINITION If after observing X; = x1, Xp = Xo, ..., X, = Xn», we compute the observed sample 
mean X and then substitute x into (8.4) in place of X, the resulting fixed interval is 
called a 95% confidence interval for 4. This CI can be expressed either as 


(x — 1.96 - at 1.96. <) is a 95% confidence interval for u 


a <U<¥+1.96- a with 95% confidence 
A concise expression for the interval is x + 1.96 - o/./n, where — gives the left 
endpoint (lower limit) and + gives the right endpoint (upper limit). 


or as 


x — 1.96: 
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Example 8.2 (Example 8.1 Continued) The quantities needed for computation of the 95% CI for 
true average tensile strength are o = 55, n = 31, and x = 1064. The resulting interval is 


t 1.96- a 1064 + 1.96 - cee = 1064 + 19.4 = (1044.6, 1083.4) 


vn V31 


We infer at the 95% confidence level that 1044.6 < uw < 1083.4. That is, with a high degree of 
certainty, the data indicates that the true mean tensile strength for this type of titanium alloy weld is 
between 1044.6 and 1083.4 MPa. a 


ta 
im 


Interpreting a Confidence Level 

The confidence level 95% for the interval just defined was inherited from the probability .95 for the 
random interval (8.4). Intervals having other levels of confidence will be introduced shortly. For now, 
though, consider how 95% confidence can be interpreted. 

We started with an event whose probability was .95—that the random interval (8.4) would capture 
the true value of 4~—and then used the data in Example 8.1 to compute the CI (1044.6, 1083.4). It’s 
therefore tempting to conclude that py is between 1044.6 and 1083.4 with probability .95. But by 
substituting x = 1064 for X, all randomness disappears; the interval (1044.6, 1083.4) is not random, 
and neither is x (while its value is unfortunately unknown to us, w is still a constant). Thus it is 
incorrect to write P(w lies in (1044.6, 1083.4)) = .95. 

A correct interpretation of “95% confidence” relies on the long-run relative frequency interpre- 
tation of probability. To say that an event A has probability .95 is to say that if the experiment on 
which A is defined is performed over and over again, in the long run A will occur 95% of the time. 
Suppose we obtain another sample of tensile strength values and compute another 95% interval. Then 
we consider repeating this for a third sample, a fourth sample, and so on. Let A be the event that 
X — 1.96-a//n<pw<X+1.96- a/,/n. Since P(A) = .95, in the long run 95% of our computed CIs 
will contain 4. This is illustrated in Figure 8.3, where the vertical line cuts the measurement axis at 
the true (but unknown) value of jz. Notice that of the 11 intervals pictured, only intervals 3 and 11 fail 
to contain p. In the long run, only 5% of all intervals so constructed would fail to contain w. 


True value of 1 
Interval 


number | 
() 
(2) ———-———>—__ 
(3) ——————e 
4) — |, —, 
(5) ———e+o—_ 
(6) ———_»—_____ 
) es 
(8) ————*——_ 
QQ) 
(10) ——$ ————_ 
(11) ——>+- 


Figure 8.3 Repeated construction of 95% CIs 


According to this interpretation, the confidence level 95% is not so much a statement about any 
particular interval such as (1044.6, 1083.4), but pertains to what would happen if a very large number 
of intervals were constructed using the same formula. Although this may seem unsatisfactory, the root 
of the difficulty lies with our interpretation of probability—it applies to a long sequence of 
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replications of an experiment, rather than just a single replication. There is another approach to the 
construction and interpretation of CIs that uses the notion of subjective probability and Bayes’ 
theorem, as discussed in Chapter 15. The interval presented here (as well as each interval presented 
subsequently) is called a “classical” CI because its interpretation rests on the classical notion of 
probability (although the main ideas were developed as recently as the 1930s). 


Other Levels of Confidence 
The confidence level of 95% was inherited from the probability .95 for the initial inequalities in (8.2). 
If a confidence level of 99% is desired, the initial probability of .95 must be replaced by .99, which 
necessitates changing the z critical value in (8.2) from 1.96 to 2.576. A 99% CI then results from 
using 2.576 in place of 1.96 in the formula for the 95% CI. 

This suggests that any desired level of confidence can be achieved by replacing 1.96 or 2.576 with 
the appropriate standard normal critical value. As Figure 8.4 shows, a probability of 1 — & is 
achieved by using Z,,/2, which captures upper-tail area «/2, in place of 1.96. 


Zz curve 


Shaded area = a /2 


Sy /2 0 Lu /2 


Figure 8.4 P(—z,/.<Z<up) =1—«4 


DEFINITION A 100(1 — «)% confidence interval for the mean y of a normal population 
when the value of o is known is given by 


2 Oo _ (oy 
(x- Zu /2* a ee =| (8.5) 


or, equivalently, by X + z,/2 -a/,/n. 


The z critical values for the most commonly used confidence levels are displayed in Table 8.1. 


Table 8.1 Values of z,/. for 90, 95, and 99% confidence 


Confidence level (%) a al2 Zol2 
90 10 05 1.645 
95 05 025 1.960 


99 O01 .005 2.576 
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Example 8.3 An introductory course has recently been changed, and the homework is now done 
online through a course management system instead of from the textbook exercises. How can we see 
if there has been improvement in student performance? Past experience suggests that the distribution 
of final exam scores under the old system was normally distributed with mean 65 and standard 
deviation 13. It is believed that the distribution is still normal with standard deviation 13, but the 
mean has potentially changed. A random sample of 40 students has a mean final exam score of 70.7. 
Let’s calculate a confidence interval for the new population mean using a confidence level of 90%. 
From Table 8.1, the z critical value is z,;2 = Zo5 = 1.645. The desired interval is then 


1 
70.7 + 1.645 - ae = 70.7 + 3.4 = (67.3, 74.1) 


40 


With 90% confidence, we can say that 67.3 < p< 74.1, i.e., the true mean final exam score of all 
students using the new homework system will be between 67.3 and 74.1. In particular, at a confidence 
level of 90%, 65 is not a plausible value of yw. Thus we can be confident that the population mean has 
improved over the previous value of 65. a 


Confidence Level, Precision, and Choice of Sample Size 

Why settle for a confidence level of 95% when a level of 99% is achievable? Because the price paid 
for the higher confidence level is a wider interval. The 95% interval extends 1.96 - a/,/n to each side 
of x, so the width of the interval is 2(1.96) -¢/,/n = 3.92 - a/./n. Similarly, the width of the 99% 
interval is 2(2.576) - ¢/\/n = 5.152-o@/,/n. That is, we have more confidence in the 99% interval 
precisely because it is wider. The higher the desired degree of confidence, the wider the resulting 
interval. In fact, the only 100% CI for yz is (—oo, 00), which is not terribly informative because, even 
before sampling, we knew that this interval covers w. 

If we think of the width of the interval as specifying its precision (with narrower intervals being 
more precise), then the confidence level (or reliability) of the interval is inversely related to its 
precision. A highly reliable interval estimate may be imprecise in that the endpoints of the interval 
may be far apart, whereas a precise interval may possess relatively low reliability. Thus it cannot be 
said unequivocally that a 99% interval is to be preferred to a 95% interval; the gain in reliability 
entails a loss in precision. 

An appealing strategy is to specify both the desired confidence level and interval width and then 
determine the necessary sample size. 


Example 8.4 Extensive monitoring of a certain operating system has suggested that response time to 
a particular editing command is normally distributed with standard deviation 25 ms. A new operating 
system has been installed, and an estimate of the true average response time yp for the new envi- 
ronment is desired. Assuming that response times are still normally distributed with o = 25, what 
sample size is necessary to ensure that the resulting 95% CI has a width of (at most) 10? The sample 
size n must satisfy 


10 = 2- (1.96) - (25/,/n) 
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Re-arranging this equation gives 
Vn = 2- (1.96) - (25) /10 = 9.80 
so 
n= 9.80° = 96.04 
Since n must be an integer, a sample size of 97 is required. a 


The general formula for the sample size n necessary to ensure an interval width w is obtained from 


W= 2+ Zy/2°0/,/n as 
2 
_ Zu /2° O 
n= ( ww)? ) (8.6) 


The smaller the desired width w, the larger n must be. In addition, n is an increasing function of 
o (more population variability necessitates a larger sample size) and also of the confidence level 
100(1 — «)% (as « decreases, z,/2 increases). 

The half-width 1.96 - ¢/,/n of the 95% CI is sometimes called the margin of error associated 
with a 95% confidence level; that is, with 95% confidence, the point estimate x will be no farther than 
this from yu. Before obtaining data, an investigator may wish to determine a sample size for which a 
particular value of the margin of error is achieved. For example, with y representing the average fuel 
efficiency (mpg) for all cars of a certain type, the objective of an investigation may be to estimate py to 
within 1 mpg with 95% confidence. More generally, if we wish to estimate 4 to within an amount 
b (the specified bound on the margin of error) with 100(1 — «)% confidence, the necessary sample 
size results from replacing w/2 by b in (8.6). 


Deriving a General Confidence Interval 
Let X,, X>, ..., X,, denote the sample on which the CI for a parameter 0 is to be based. The general 
strategy for deriving a CI relies on finding what’s known as a pivotal quantity. 


DEFINITION Suppose a random variable satisfying the following two properties can be found: 


1. The variable is a function of both X, ..., X, and 0. 
2. The probability distribution of the variable does not depend on @ or on 
any other unknown parameters. 


Such a random variable is called a pivotal quantity. 


For example, if the population distribution is normal with o known and 0 = u unknown, the variable 
Z = (X — n)/(c/,/n) in (8.1) satisfies both properties: (1) Z clearly depends functionally on the X;’s 
and p, yet (2) Z has a N(0, 1) distribution, which does not depend on yw. Hence Z is a pivotal quantity. 
In general, the form of a pivotal quantity is usually suggested by examining the distribution of an 
appropriate estimator 0. 

Let h(X1,...,Xn,@) denote a general pivotal quantity. For any « between 0 and 1, constants a and 
b can be found to satisfy 
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P(a<h(X1,...,Xn,0)<b) =1—a (8.7) 


Critically, because of the second property of a pivotal quantity, a and b do not depend on 0. In the 
normal example, a = —z,,/2 and b = z,/2. Now suppose the inequalities in (8.7) can be manipulated to 
isolate 0 (typically, replace < by = and solve for 0), giving the equivalent statement 


P((X1, . Xn) <O0<u(X, ee .:Xn)) =l-a 


Then /(x,, ..., X,) and u(x, ..., xX,) are the lower and upper confidence limits, respectively, for a 
100(1 — «)% CI. In the normal example, we saw that /(X,...,X,)=X—z,2-0//n and 


u(X,...,Xn) =X+u4y)2 : o//n. 


Example 8.5 A theoretical model suggests that the time-to-breakdown of an insulating fluid between 
electrodes at a particular voltage has an exponential distribution with unknown parameter / (see 
Section 4.4). A random sample of n = 10 breakdown times yields the following sample data (in min): 
xX, = 41.53, x = 18.73, x3 = 2.99, x4 = 30.34, x5 = 12.33, x6 = 117.52, x7 = 73.02, xg = 223.63, 
X9 = 4.00, x19 = 26.78. A 95% CI for both 4 and for the true average breakdown time are desired. 

Let h(X,, Xo, ..., X,, 2) = 22 &X;. Using a moment generating function argument, it can be shown 
that this random variable has a chi-squared distribution (see Section 6.3) with 2n degrees of freedom. 
Since h is a function of both the X;’s and A, yet its distribution 73,, does not depend on 4, it is a pivotal 
quantity. 

Appendix Table A.5 pictures a typical chi-squared density curve and tabulates critical values that 
capture specified tail areas. The v = 2n = 2(10) = 20 row of the table shows that the .025 and .975 
quantiles are 9.591 and 34.170, respectively. Thus for n = 10, 


P(9.591 < 24° X; < 34.170) = .95 


Division by 2 5~ X; isolates A, yielding 


P(9.591/(2 Xi) <A <34.170/ (25>%x))) = 95 


The lower limit of the 95% CI for / is 1 = 9.591/(2Xx;), and the upper limit is u = 34.170/(2)°x;). For 
the given data, }>x; = 550.87, giving the interval (.00871, .03101). Based on the data, we are 95% 
confident that the true value of the parameter 1 is between .00871 and .03101. 

The mean of an exponential rv is uw = 1/2. Since 


P(2 S~X;/34.170 <1/h<2 > %/9.591) = 95 


the 95% CI for true average breakdown time is (2) °x/34.170, 2)°>x/9.591) = (32.24, 114.87). With 
95% confidence, true mean breakdown time under these experimental conditions is between 32.24 
and 114.87 min. This interval is obviously quite wide, reflecting substantial variability in breakdown 
times and a small-sample size. Notice also that the two endpoints are not equidistant from the point 
estimate; unlike in the normal case, here the CI for yz is not of the form x +c. i 


A General Large-Sample Confidence Interval 
Let Xj, X>, ..., X,, be a random sample from any population having a mean p and standard deviation 
o. Provided that n is large, the Central Limit Theorem (CLT) implies that X has approximately a 


8.1 Basic Properties of Confidence Intervals 459 


normal distribution whatever the nature of the population distribution. It then follows that Z = 
(X — )/(o/,/n) has approximately a standard normal distribution, so that 


P ee 1 
_ ues ia x~l-a 
Zu/2 o/ Jn Zu/2 


An argument parallel with that given earlier in this section yields ¥ + z,/) -o/,/n as a large-sample CI 
for 4 with a confidence level of approximately 100(1 — «)%. That is, when n is large, the CI (8.5) for 
jt remains valid whatever the population distribution (provided that the qualifier “approximately” is 
inserted in front of the confidence level). 

The foregoing example is a special case of a general large-sample CI for a parameter 0. Suppose 


that 0 is an estimator satisfying the following properties: 

1. O has approximately a normal distribution; 

2. 0 is (at least approximately) unbiased for 0; and 

3. an expression for og, the standard deviation of 6, is available. 
For example, in the above discussion @ = ju, jt = X is an unbiased estimator whose distribution is 
approximately normal when n is large, and og = oy = o/,\/n. In Section 7.4, we saw that under very 
general conditions a maximum likelihood estimator @ satisfies the first two properties when n is large, 
so what follows can be applied to many mles. 


Standardizing 0 yields the rv Z = (0 — 0)/o , which has approximately a standard normal dis- 
tribution, making Z an approximate pivotal quantity. This justifies the probability statement 


0-0 
of a < —— <x] ~l—a (8.8) 
%% 


from which a 100(1 — «)% CI for @ may potentially be obtained. How we proceed then depends on the 
formula for a9. 
Suppose first that gj does not involve any unknown parameters. Then replacing each < by = in 


(8.8) and solving for @ results in confidence limits O+ Zu/2 °F for 0. 

Next, suppose that a, doesn’t involve 0 itself but does involve at least one other unknown 
parameter. Let sj be the estimate of oj obtained by using estimates in place of the unknown 
parameters, e.g., s/,/n estimates o/,/n. Under general conditions (essentially that s) be close to a, 


for most samples), a valid CI for 0 is then 0+ Zy/2 * Sg- The interval ¥ + z,/2 - s /\/n is an example; we 
will encounter this interval in the next section. 
Finally, suppose that oj involves the unknown 0 itself. This is the case, for example, when 0 = p, a 


population proportion, as we’ll see in Section 8.3. Then (0 — 0)/o) = 24/2 can be difficult to solve. 


An approximate solution can often be obtained by replacing 0 in oj by its estimate 0. This results in 


an estimated standard deviation sj, and the corresponding interval is again O+ 2y/2 * Sp: 


Example 8.6 A shipping company offers a flat fee for packages weighing up to 1 Ib. Let X),...,X,, 
represent the weights (lb) of a random sample of packages ready for shipment, modeled by the pdf 
f(x; 0) = Ox°-! for 0 < x < 1. The goal is to obtain a CI for the parameter 0. 
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From Example 7.37, the maximum likelihood estimator of 0 is @ = —n/ S~ In(X;), which for large 
n has approximately a normal distribution with mean @ and standard deviation 0/./n. The preceding 


discussion then suggests a CI of the form O+ Zz/20/./n, but this is impractical—the standard error is 
a function of the unknown parameter 0 itself. One solution is to solve the system of inequalities 


—Z4/2< <2u/2 


a-0 
O//n 


suggested by (8.8) for 0. Alternatively, we could substitute @ for 0 in the standard deviation formula, 


resulting in a CI with endpoints 0+ Z4/20/ /n. That is, once the data is obtained and the value 
0 = —n/ > In(x;) is calculated, that value of 0 is used twice to compute the CI. a 


One-Sided Confidence Intervals (Confidence Bounds) 
The confidence intervals discussed thus far give both a lower confidence bound and an upper 
confidence bound for the parameter being estimated. In some circumstances, an investigator will want 
only one of these two types of bounds. For example, a psychologist may wish to calculate a 95% 
upper confidence bound for true average reaction time to a particular stimulus, or a surgeon may want 
only a lower confidence bound for true average remission time after colon cancer surgery. 

In general, an upper confidence bound for a parameter @ with confidence level 100(1 — «)% 
based on a random sample X,...,X, is a quantity u(X1,...,X,) such that 


P(0<u(X1,...,X,))=1—a 


Similarly, a lower confidence bound /(X,,...,X;,) satisfies P(/(X1,...,Xn)<0@) =1-—a. As with 
two-sided confidence intervals, such bounds are evaluated by substituting the observed values 
X, = x1,X2 = X2,...,Xy = Xn. One-sided confidence bounds are often obtained by identifying a 
pivotal quantity and manipulating an appropriate inequality statement to isolate the parameter 0. 


Example 8.7 Consider again the scenario of a random sample X,,...,X, from a normal distribution 
for which o is known. Because the cumulative area under the standard normal curve to the left of 
1.645 is .95, 


X—y 
P(e 21645 |=: 
(=< 645) 95 


Manipulating the inequality inside the parentheses to isolate uw on one side gives the inequality 
X — 1.6450/,/n<; the expression on the left is a lower confidence bound for yu. Applied to the data 
from Example 8.1, we obtain a 95% lower confidence bound of 1064 — 1.645 -55/ J31 = 
1047.75 MPa for the true average tensile strength. 

Starting with P(—1.645 < Z) = .95 and manipulating the inequality results in an upper confidence 
bound. A similar argument gives a one-sided bound associated with any other confidence level. Mf 
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Basic Properties of Confidence Intervals 


Exercises: Section 8.1 (1-12) 


1. 


Consider a normal population distribution 
with the value of o known. 


a. What is the confidence level for the 
interval x + 2.810/,/n? 

b. What is the confidence level for the 
interval ¥ + 1.44¢/,/n? 

c. What value of z,,/2 in the CI Formula (8.5) 
results in a confidence level of 99.7%? 

d. Answer the question posed in part (c) for 
a confidence level of 75%. 


. Each of the following is a confidence interval 


computed from (8.5) for p= true average 

(i.e., population mean) resonance frequency 

(Hz) for all tennis rackets of a certain type: 
(114.4, 115.6) (114.1, 115.9) 


a. What is the value of the sample mean 
resonance frequency? 

b. Both intervals were calculated from the 
same sample data. The confidence level 
for one of these intervals is 90% and for 
the other is 99%. Which of the intervals 
has the 90% confidence level, and why? 


. Suppose that a random sample of 50 bottles 


of a particular brand of cough syrup is 
selected and the alcohol content of each 
bottle is determined. Let 1 denote the average 
alcohol content for the population of all 
bottles of the brand under study. Suppose that 
the resulting 95% confidence interval is (7.8, 
9.4). 


a. Would a 90% confidence interval calcu- 
lated from this same sample have been 
narrower or wider than the given interval? 
Explain your reasoning. 

b. Consider the following statement: There 
is a 95% chance that 1 is between 7.8 and 
9.4. Is this statement correct? Why or why 
not? 

c. Consider the following statement: We can 
be highly confident that 95% of all bottles 
of this type of cough syrup have an 
alcohol content that is between 7.8 and 
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9.4. Is this statement correct? Why or why 
not? 

d. Consider the following statement: If the 
process of selecting a sample of size 50 
and then computing the corresponding 
95% interval is repeated 100 times, 95 of 
the resulting intervals will include w. Is 
this statement correct? Why or why not? 


4. A Clis desired for the true average stray-load 


loss 4 (watts) for a certain type of induction 

motor when the line current is held at 10 

amps for a speed of 1500 rpm. Assume that 

stray-load loss is normally distributed with 

o = 3.0. 

a. Compute a 95% CI for when n = 25 
and x = 58.3. 

b. Compute a 95% CI for 4 when n = 100 
and x = 58.3. 

c. Compute a 99% CI for «4 when n = 100 
and x = 58.3. 

d. Compute an 82% CI for « when n = 100 
and x = 58.3. 

e. How large must n be if the width of the 
99% interval for yu is to be 1.0? 


. Assume that the helium porosity (in per- 


centage) of coal samples taken from any 
particular seam is normally distributed with 
true standard deviation .75. 


a. Compute a 95% CI for the true average 
porosity of a certain seam if the average 
porosity for 20 specimens from the seam 
was 4.85. 

b. Compute a 98% CI for true average 
porosity of another seam based on 16 
specimens with a sample average porosity 
of 4.56. 

c. How large a sample size is necessary if 
the width of the 95% interval is to be .40? 

d. What sample size is necessary to estimate 
true average porosity to within .2 with 
99% confidence? 


. On the basis of extensive tests, the yield point 


of a particular type of mild steel reinforcing 
bar is known to be normally distributed with 
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10. 


2.0 1:3 
15.7 7 


o = 100. The composition of the bar has been 
slightly modified, but the modification is not 
believed to have affected either the normality 
or the value of o. 


a. Assuming this to be the case, if a sample of 
25 modified bars resulted in a sample 
average yield point of 8439 lb, compute a 
90% CI for the true average yield point of 
the modified bar. 

b. How would you modify the interval in 
part (a) to obtain a confidence level of 
92%? 


. By how much must the sample size n be 


increased if the width of the CI (8.5) is to be 
halved? If the sample size is increased by a 
factor of 25, what effect will this have on the 
width of the interval? Justify your assertions. 


. Let a, > 0, o& > 0, with «1, + a = «. Then 


Oh 2, 
: a//n 7 


a. Use this equation to derive a more general 
expression for a 100(1 — «)% CI for py of 
which the interval (8.5) is a special case. 

b. Let x = .05 and a = /4, a = 30/4. Does 
this result in a narrower or wider interval 
than the interval (8.5)? 


a. Generalize the method of Example 8.7 to 
obtain a lower bound for w with a confi- 
dence level of 100(1 — «)%. 

b. Use part (a) to calculate a 99.5% confi- 
dence lower bound for the data in Exer- 
cise 5a. 

c. What is the analogous formula for a 
100(1 — «)% confidence upper bound on 
i? Compute this 99% upper bound for the 
data of Exercise 4a. 


A random sample of n = 15 heat pumps of a 
certain type yielded the following observa- 
tions on lifetime (in years): 


6.0 1.9 5.1 4 1.0 
4.8 2o 12.2 


11. 


12. 


Number of 0 1 2 38 
absences 


Frequency 1 
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a. Assume that the lifetime distribution is 
exponential and use an argument parallel 
to that of Example 8.5 to obtain a 95% 
CI for expected (true average) lifetime. 

b. How should the interval of part (a) be 
altered to achieve a confidence level of 
99%? 

c. What is a 95% CI for the standard 
deviation of the lifetime distribution? 
[Hint: What is the standard deviation of 
an exponential random variable?] 


Consider the next 1000 95% CIs for yw that 
a Statistical consultant will obtain for vari- 
ous clients. Suppose the data sets on which 
the intervals are based are selected inde- 
pendently of one another. How many of 
these 1000 intervals do you expect to cap- 
ture the corresponding value of 4? What is 
the probability that between 940 and 960 of 
these intervals contain the corresponding 
value of mu? [Hint: Let Y= the number 
among the 1000 intervals that contain uw. 
What kind of random variable is Y?] 


The superintendent of a large school dis- 
trict, having once had a course in proba- 
bility and_ statistics, believes that the 
number of teachers absent on any given day 
has a Poisson distribution with parameter 1. 
Use the accompanying data on absences for 
50 days to derive a large-sample CI for yu. 
[Hint: The mean and variance of a Poisson 
variable both equal 1, so 


has approximately a standard normal dis- 
tribution and is thus a pivotal quantity. 
Now proceed as in Example 8.6.] 
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8.2 The One-Sample t Interval and Its Relatives 


The CI for given in the previous section assumed that the population distribution is normal with the 
value of ¢ known. The derivation of the interval relied on the pivotal quantity Z = (X — p)/(o/\/n) 
in (8.1), which has a standard normal distribution under these assumptions. In this section, we will 
construct a CI for 4 for the more realistic situation when o is unknown; this is the interval estimate 
used in practice. 

Consider the variable obtained by replacing o in Z by the sample standard deviation S. Define a 
new random variable T by 


_X-u 
S/n 


It is important to contrast the behavior of Z in repeated sampling with that of T (this is really a 
refresher from Section 6.4). The only variability in Z from one sample to another is because the value 


T (8.9) 


of X in the numerator varies in value. However, there are two sources of sample-to-sample variability 
in T: both X in the numerator and S in the denominator. Because of this extra variation in T, it stands 
to reason that the distribution of T should be more spread out than that of Z. That is, the density curve 
for T should be more spread out than the standard normal curve. 


The One-Sample t Confidence Interval 

Suppose that X),...,X, is a random sample from a normal population distribution. Then Gosset’s 
Theorem from Section 6.4 states that the rv T in (8.9) follows a ¢ distribution with n — 1 degrees of 
freedom (df). Properties of the ¢ family of distributions were detailed in Section 6.3; for now, it 
suffices to recall that the ¢ distribution with v df has a symmetric, bell-shaped density curve centered at 
0 that is wider than a standard normal curve but converges to the standard normal curve as v — oo (so 
the z curve may be thought of as the ¢ curve with df = 00). See Figure 6.16 for an illustration. Recall 
also the notation for values that capture particular upper-tail ¢ curve areas. 


NOTATION _ Let ¢,, = the number on the measurement axis for which the area under the f curve 
with v df to the right of t,., is 0; ty, is called a ¢ critical value. 


This notation is illustrated in Figure 8.5. Appendix Table A.6 gives t,, for selected values of « and v. 
The columns of the table correspond to different values of «. To obtain f5 15, go to the a = .05 
column, look down to the v = 15 row, and read f.95 )5 = 1.753. Similarly, t.95.22 = 1.717 (.05 column, 
v = 22 row), and fo1,22 = 2.508. Statistical software packages can provide f critical values for any 
specified tail area and df; for example, t,,, can be obtained in R with the command gt (1-4 v). 


4 t, curve 


Shaded area = @ 


Figure 8.5 A pictorial definition of t,,, 
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The values of ft, exhibit regular behavior as we move across a row or down a column. For fixed v, 
ty, Increases as « decreases, since we must move farther to the right of zero to capture area & in the 
tail. For fixed «, as v is increased (i.e., as we look down any particular column of the ¢ table) the value 
of t,,, decreases. This is because a larger value of v implies a ¢ distribution with smaller spread, so it is 
not necessary to go so far from zero to capture tail area ~. Furthermore, t,,, decreases more slowly as v 
increases. Consequently, the table values are shown in increments of 2 between 30 and 40 df and then 
jump to v = 50, 60, 120, and finally oo. Because tf, is the standard normal curve, the familiar z,, 
values appear in the last row of the table. 

Now let’s obtain the desired confidence interval. The pivotal quantity T in (8.9) has a t,1 
distribution, and the area under the corresponding ¢ density curve between —f,/2,,—; and t,/2,n—1 is 
1 — & (area «/2 lies in each tail), so 


P(—tyjan—i <T< bifint) =1l-a (8.10) 


Expression (8.10) differs from similar expressions in Section 8.1 in that T and t,/2,,-1 are used in 
place of Z and z,,2, but it can be manipulated in the same manner to obtain a confidence interval for yu. 


PROPOSITION Let x and s be the sample mean and sample standard deviation computed from 
the results of a random sample from a normal population with mean yw. Then a 
100(1 — «)% confidence interval for y, also called the one-sample ¢ CI, is 


AY 


S 
(x _ ty /2.n-1 . — by /2.n-1 . =z) (8.11) 


or, more compactly, ¥ + ty/2,n—1° 8/ Jn. 


An upper confidence bound for p is 


_ S 
X+ tyn—-1 * = 


Vn 


and replacing + by — in this latter expression gives a lower confidence bound 
for py; both have confidence level 100(1 — «)%. 


Example 8.8 Have you ever dreamed of owning a Porsche? Even though academic salaries leave 
little room for luxuries, the authors thought maybe the purchase of a used Boxster, the least expensive 
Porsche model, might be feasible. So on July 15, 2019 we went to www.cars.com to peruse prices. 
The news was discouraging, so we instead selected a random sample of 16 such vehicles and obtained 
the following odometer readings (miles): 


80,000 30,100 97,500 58,551 73,787 51,800 69,267 44,530 
42,192 104,920 41,442 27,418 43,436 77,219 5991 14,362 


Figure 8.6 shows a normal probability plot of the data; this version includes a superimposed line 
which makes it easier to judge whether the pattern in the plot is reasonably linear. Very clearly that is 
the case. It is therefore quite plausible that the distribution of odometer readings is (at least 
approximately) normal, which validates the use of the one-sample ¢t confidence interval to estimate the 
population mean odometer reading, yw. 
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Figure 8.6 Normal 99 
probability plot of the Boxster 
odometer reading data 
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The sample mean and standard deviation are 53,907.2 and 28,287.2, respectively, and the (esti- 
mated) standard error of the mean is s/,/n = 7071.8. Table A.6 shows that the f critical value for a 
confidence level of 95% when df = 16 — 1 = 15 is to95,15 = 2.131. The confidence interval is then 

Ss 


X+ ty/rn1- a §3,907.2 + (2.131)(7071.8) = 53,907.2 + 15,070.0 


= (38,837.2, 68,977.2) 


That is, we can say with a confidence level of 95% that 38,837.2 < u < 68,977.2. This CI is quite 
wide, indicating that our knowledge of w is imprecise. 

Remember that it is not correct at this point to write P(38,837.2 < u < 68,977.2) = .95, because 
nothing inside the parentheses is random. The interval we have calculated may or may not include the 
actual value of u. If we were to obtain sample after sample of size 16 from this population and for 
each one use (8.11) with ¢ = 2.131, in the long run 95% of the calculated CIs would include yu 
whereas 5% would not. Without knowing the value of 4, we can’t know whether the particular 
interval we have calculated is one of the “good” 95% or the “bad” 5%. i 


Gosset’s Theorem and the resulting one-sample ¢ CI (8.11) assume a normal population distri- 
bution, which can be validated using a normal probability plot. Thankfully, the one-sample ¢ CI for 
is robust to small or even moderate departures from normality unless n is quite small. By “robust,” we 
mean that if a ¢ critical value for 95% confidence is used in calculating the interval, the actual 
confidence will be reasonably close to the nominal 95% level, and similarly for other confidence 
levels. As a result, many practitioners use (8.11) if either the population distribution is plausibly 
normal or the sample size is “large’—n > 40 is a popular criterion. 

It’s worth noting that if the sample size is large, whether we use a ¢ or z critical value does 
not make much practical difference. Thanks to the Central Limit Theorem, the random variable 
(X — w)/(a/,/n) has an approximately standard normal distribution when n is large; simultaneously, 
S is highly likely to be close to a, suggesting that T in (8.9) is also approximately normal. Thus, for 
large n, one may apply the CI formula 
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S S 
(3-29: Je 3+ ape) (8.12) 


in lieu of (8.11). A large-sample upper confidence bound for y results from replacing z,/2 with z, in 
the upper limit of the interval (8.12); an analogous lower bound is obtained from the same 
replacement made to the lower limit of (8.12). 


Example 8.9 A survey published by Gallup (Nov. 1, 2019) of 1526 adults asked how much each 
person planned to “personally spend on Christmas gifts” in 2019. The mean response was $942 with a 
standard deviation of $1116. Clearly the distribution of planned expenses is strongly positively 
skewed (the standard deviation exceeds the mean); nevertheless, let’s calculate an (approximate) 99% 
CI for yu, the mean amount all US adults planned to spend on Christmas presents in 2019. 

Because n = 1526 is very large, either Expression (8.11) or (8.12) is appropriate, even though the 
population distribution is nonnormal. The z and f critical values are Zo95 = 2.576 and 
t.005,1525 = 2.579, so the resulting CIs will be essentially identical. Using the latter, the resulting CI is 


1116 
942 + 2.579 - = 942 + 73.7 = (868.3, 1015.7 
Vv 1526 ( ) 


At the 99% confidence level, we conclude that the average amount US adults planned to personally 
spend on Christmas gifts in 2019 was between $868.30 and $1015.70. Hi 


Sample Size Determination 
In Section 8.1, we considered the problem of determining the sample size required to achieve a 
certain level of precision at a prescribed confidence level. Under the assumptions of that section, we 


derived the formula 
Zu/2* oO 2 
n= 
w/2 


for the minimum sample size necessary to place an upper bound w on the width of the interval. Given 
the discussion in this section, it might seem like the natural update to this formula is 


n= (mas) (8.13) 


where s is the sample standard deviation. However, this formula presents two practical problems. 


First, sample size determination typically occurs before a study is carried out, in which case the 
researcher doesn’t yet have a value for s. This can be addressed by using a sample standard deviation 
from a previous, similar study, though that assumes variability has not changed significantly. 
Another, more conservative method is to use range/4 as a crude estimate of the standard deviation; 
this is “conservative” in the sense that for many distributions range/4 exceeds s, so use of this estimate 
in (8.13) typically returns a somewhat larger n than needed. The range/4 formula has the advantage 
that even in the absence of reliable data from which to calculate s, the range of potential values from a 
particular process is typically easier to “guess.” 
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Second, n now appears on both sides of the equation: we need to know n before finding the 
t critical value, which then determines the sample size n on the left-hand side of (8.13). Technology 
can solve this problem: many statistical software packages will search iteratively to find the smallest 
n which satisfies (8.13). Alternatively, you can simply replace f,/2,,_1 with z,/2 in (8.13) and solve for 
n, but the result will be slightly smaller than the sample size actually required. Notice, though, that 
even software still requires a value (or an estimate) for s. 


Example 8.10 A supplier of carbon-ceramic brake disks for high-performance cars has recently 
redesigned its manufacturing process. The company needs to estimate, among other things, the true 
mean density u of their new ceramic material. Data from the previous process suggests that the 
standard deviation for ceramic density is around 0.2 g/cm’, Assuming this value still approximately 
holds for the new process, how large a sample will the company require to obtain a 99% CI for yz no 
wider than 0.1 g/em?? Apply (8.13) with s = 0.2, w = 0.1, and toosn-1 © Z.005 = 2.576: 


2.576(0.2)\” 
x | ——~_~ } = 106.17 
[ ( 0.1/2 ) 
Since n must be an integer, a minimum of 107 ceramic specimens is required. As noted previously, 
this will be a slight underestimate, because we used 2.576 in place of the unknown ¢ critical value. 
Statistical software, which does not use this approximation, indicates that at least 110 specimens are 
required. a 


A Prediction Interval for a Single Future Value 
In many applications, an investigator wishes to predict a single value of a variable to be observed at 
some future time, rather than to estimate the mean value of that variable. 


Example 8.11 Scientists worldwide routinely monitor the general health of forests, and engineers 
investigate mechanical properties of various wood types. Consider the following core wood density 
measurements (g/mm?) from a sample of 25 canopy trees in western Thailand (“Radial Variation of 
Wood Functional Traits Reflect Size-Related Adaptations of Tree Mechanics and Hydraulics,” 
Functional Ecology 2017): 


391.2 431.0 447.1 375:3 470.7 543.7 592.7 546.7 601.8 598.8 
492.3 454.4 548.7 494.9 585.6 647.8 639.2 700.4 640.1 620.5 
755.2 668.7 644.6 117.7 663.0 


Figure 8.7 shows a normal probability plot from R software. The straightness of the pattern 
provides support for assuming that core wood density measurements in this population are at least 
approximately normal. 

The sample mean and standard deviation are ¥ = 570.9 g/mm* and s = 103.9 g/mm*, respectively. 
A 95% CI for = the population mean core wood density is 


X = £925.24 ° a = 570.9 + 2.064 - ord = 570.9 + 42.9 


= (528.0, 613.8) 
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Figure 8.7 Normal probability plot for the wood density data of Example 8.11 


This is fine if researchers are interested in average properties of these trees. But what about a single 
tree from this forest—what should we predict for its core wood density? A point prediction, anal- 
ogous to a point estimate, is just ¥ = 570.9 g/mm*. But this prediction unfortunately gives no 
information about reliability or precision. A different type of interval is required to make inferences 
about the density of an individual wood specimen. a 


The general scenario is as follows: We have available a random sample X), X2, ..., X, from a 
normal population distribution, and we wish to predict the value of X,,,;, a single future observation. 
A point predictor is X, and the resulting prediction error is X — X,,, 1. The expected value of the 
prediction error is 


E(X — Xn+1) = E(X) — E(Xn41) =p—p=0 


Since X,,,; is independent of X, ..., X,,, it is independent of X, so the variance of the prediction error is 
= =, o 2 2 1 
V(X — Xng1) = V(X) +V(Xn41) == +9 =o eae 


The prediction error is a linear combination of independent normally distributed rvs, so itself is 
normally distributed. Thus 


(X —Xn41)-0 XS Xu 


(o(i+ 1) : (r(l+ 1) 
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has a standard normal distribution. As in the derivation of the distribution of (X — “)/(S/\/n) 
in Section 6.4, it can be shown (Exercise 42) that replacing o by the sample standard deviation S 
(of X,, ..., X,) results in 


xX-X 
T — <2"! W tdistribution withn — 1 df 


S\/1++ 


Manipulating this T variable using Expression (8.10) to isolate X, +1 gives the following result. 


PROPOSITION A prediction interval (PI) for a single observation to be selected from a normal 
population distribution is 


sl 
im 


1 
E by /2n—1 ‘S Pe (8.14) 


The prediction level is 100(1 — «)%. 


The interpretation of a 95% prediction level is similar to that of a 95% confidence level: if the interval 
(8.14) is calculated for sample after sample, in the long run 95% of these intervals will include the 
corresponding future values of X. 


Example 8.12 (Example 8.11 continued) With n = 25, x = 570.9, s = 103.9, and f.925.04 = 2.064, a 
95% PI for the core wood density of a single tree in this Thai forest is 


el 
m 


1 1 
E t25,24 + 84/1 + — = 570.9 + (2.064)(103.9) 4/1 + 55 = 570.9 + 218.7 
n 
= (352.2, 789.6) 


So, with 95% certainty, we predict that the core wood density of an as-yet-unmeasured tree in this 
forest will be between 352.2 and 789.6 g/mm*, This PI is quite wide—more than five times as wide as 
the previous CI—a reflection of substantial variability in wood density in this forest. el 


It’s worth contrasting the behavior of the one-sample ¢ CI (8.11) with the PI (8.14). The PI is wider 
than the CI because there is more variability in the prediction error (due to X,,,,) than in the estimation 
error. In fact, as n gets arbitrarily large, the CI shrinks to the single value yu, while the PI approaches 
L + 22:0, an interval that covers the middle 100(1 — «)% of a normal distribution. That’s as it should 
be: there is uncertainty about a single future X value even when there is no need to estimate any 
parameters. 


Tolerance Intervals 

In addition to confidence intervals and prediction intervals, statisticians are sometimes called upon to 
obtain a third type of interval called a tolerance interval (TI). A TI is an interval that with a high 
degree of reliability captures a specified percentage of the population distribution. For example, if the 
population distribution of women’s heights is normal, then the interval from p — 1.6450 to 
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Li + 1.6456 captures 90% of the height values in the population of women. It can then be shown that 
if « and o are replaced by their natural estimates x and s based on a sample of size n = 20 and the 
z critical value 1.645 is replaced by a tolerance critical value 2.310, the resulting interval contains at 
least 90% of the population values with a confidence level of 95%. 

Please consult one of the references for more information on TIs. And before you calculate a 
particular statistical interval, be sure that it is the correct type of interval to fulfill your objective! 


Intervals Based on Nonnormal Population Distributions 

As mentioned previously, the one-sample t¢ CI for is robust to small or even moderate departures 
from normality unless n is quite small. If, however, n is small and the population distribution is highly 
nonnormal, then your actual confidence level may be considerably different from the one you think 
you get from using a particular ¢ critical value. It would certainly be distressing to believe that your 
confidence level is about 95% when in fact it was really more like 88% (or worse)! The bootstrap 
technique, discussed in the last section of this chapter, has been found to be quite successful at 
estimating parameters in a wide variety of nonnormal situations. 

In contrast to the confidence interval, the validity of the prediction interval described in this section 
is closely tied to the normality assumption. The prediction interval (8.14) should not be used in the 
absence of compelling evidence for normality. The excellent reference Statistical Intervals by Meeker 
et al., cited in the bibliography, discusses alternative procedures of this sort for various other 
situations. 


Exercises: Section 8.2 (13-42) 


13. Determine the values of the following 16. Determine the f critical value for a lower or 
quantities: an upper confidence bound for each of the 
a. tis b. tosis €- tos.25 d. 405,40 ©. t.005,40 situations described in the previous exercise. 


14. Determine the f¢ critical value that will 17. Here are the alcohol percentages for a 


capture the desired ¢ curve area in each of random sample of 16 beers (light beers 
the following cases: excluded): 
a. Central area = .95, df = 10 

= _ 4.68 4.13 4.80 4.63 5.08 5.79 6.29 6.79 
b. Central area = .95, df = 20 4.93 4.25 5.70 4.74 5.88 6.77 6.04 4.95 
c. Central area = .99, df = 20 
d. Central area = .99, df = 50 a. Construct a normal probability plot of 
e. Upper-tail area = .01, df = 25 


the data. Is it plausible that these values 


f. Lower-tail area = .025, df = 5 represent a sample from a_ normal 


15. Determine the ¢ critical value for a two- distribution? 
sided confidence interval in each of the b. Calculate a 95% CI for the mean alcohol 
following situations: percentage of all nonlight beers. 
a. Confidence level = 95%, df = 10 c. Calculate a 95% CI for the mean 
b. Confidence level = 95%, df = 15 amount of alcohol, in ounces, in a 12- 
c. Confidence level = 99%, df = 15 oz. serving of (again, nonlight) beer. 
d. Confidence level = 99%, n = 5 18. A random sample of 50 patients who had 
e. Confidence level = 98%, df = 24 been seen at an outpatient clinic was 
f. Confidence level = 99%, n = 38 


selected, and the waiting time to see a 
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19. 


20. 


21. 


physician was determined for each one, 
resulting in a sample mean time of 
40.3 min and a sample standard deviation 
of 28.0 min (suggested by the article “An 
Example of Good but Partially Success- 
ful OR Engagement: Improving Outpatient 
Clinic Operations,” Interfaces 28, #5). 


a. Calculate and interpret a 95% upper 
confidence bound for true average 
waiting time. 

b. Based on the sample mean and standard 
deviation, why is it doubtful that the 
population of waiting times is normally 
distributed? Does that invalidate the con- 
fidence bound you calculated in part (a)? 


Exercise 16 of Chapter | presented data on 
the noise level (dBA) experienced by a 
sample of 77 individuals working at a 
particular office. 


a. Construct a 95% confidence interval for 
the true average noise level experienced 
by people working in this office. 

b. Would it be feasible to construct a valid 
95% PI for the noise level experienced 
by a single office worker? Why or why 
not? 


According to a study published in the 
Calgary Herald (Sep. 17, 2005), the aver- 
age daily commute time for workers in 

Calgary is 28.5 min with a standard devi- 

ation of 24.2 min. The survey respondents 

constituted a random sample of 500 adults 
living in Calgary. 

a. Construct and interpret a 99% CI for the 
true average daily commute time of all 
adults living in Calgary. 

b. Calculate a 99% CI for the average 
weekly commute time of _ this 
population. 


An article in Issues in Accounting Educa- 
tion reported on the job-changing habits of 
individuals who started at a Big Eight 
accounting firm. In a random sample of 44 
such people who subsequently changed 


22: 


23. 
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jobs, the sample mean time to change was 
35.02 months with a standard deviation of 
18.94 months. 


a. Construct and interpret a 95% CI for the 
true average time to change jobs for this 
population. 

b. Construct a 95% PI for the time to 
change jobs for a randomly selected 
person starting at a Big Eight account- 
ing firm. Are there any extra assump- 
tions you must make for this interval to 
be valid? Do those assumptions seem 
credible here? 


Frontier Airlines conducted a study of 
passenger weights, including carry-on 
items (Alaska J. Commerce, May 25, 
2003). They found an average summer 
weight of 183 lbs and an average winter 
weight of 190 lbs. Suppose that both of 
these surveys were based on random sam- 
ples of 90 people and that the sample 
standard deviations for the summer and 
winter groups were 25 and 28, respectively. 


a. Construct and interpret a 95% CI for 
true average passenger weight (includ- 
ing carry-ons) during the summer for 
Frontier Airlines. 

b. Repeat part (a) for the winter sample. 

c. Federal Aviation Administration 
(FAA) guidelines state that typical pas- 
senger weight should be 190 Ibs in the 
summer and 195 Ibs in the winter. 
Based on the confidence intervals in 
parts (a) and (b), do Frontier Airlines 
passengers appear to meet FAA rec- 
ommendations? Explain. 


Consider the following sample of fat con- 
tent (in percentage) of n = 10 randomly 
selected hot dogs (“Sensory and Mechani- 
cal Assessment of the Quality of Frank- 
furters,” J. Texture Stud. 1990: 395-409): 


25.2 21.3 22.8 17.0 29.8 21.0 25.5 16.0 20.9 19.5 
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24. 


Assume that these were selected from a 
normal population distribution. 


a. Compute a 95% confidence interval for 
the population mean fat content. 

b. Would a 90% CI be wider or narrower 
than the interval you computed in (a)? 

c. Would a 95% CI based on a sample of 
n = 20 hot dogs be wider or narrower 
than the interval you computed in (a)? 

d. Calculate a 95% PI for the fat content of 
a single hot dog. 


Here is a sample of ACT scores (average of 
the Math, English, Social Science, and 
Natural Science scores) for students taking 
college freshman calculus: 


24.00 28.00 27.75 27.00 24.25 23.50 26.25 
24.00 25.00 30.00 23.25 26.25 21.50 26.00 
28.00 24.50 22.50 28.25 21.25 19.75 


25. 


a. Using an appropriate graph, see if it is 
plausible that the observations were 
selected from a normal distribution. 

b. Calculate a two-sided 95% confidence 
interval for the population mean. 

c. The university ACT average for entering 
freshmen that year was about 21. Are the 
calculus students better than average, as 
measured by the ACT? 


A sample of 14 joint specimens of a par- 
ticular type gave a sample mean propor- 
tional limit stress of 8.48 MPa and a sample 
standard deviation of .79 MPa (“Charac- 
terization of Bearing Strength Factors in 
Pegged Timber Connections,” J. Struct. 
Engr. 1997: 326-332). 


a. Calculate and interpret a 95% lower 
confidence bound for the true average 
proportional limit stress of all such 
joints. What, if any, assumptions did 
you make about the distribution of 
proportional limit stress? 

b. Calculate and interpret a 95% lower 
prediction bound for the proportional 
limit stress of a single joint of this type. 
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26. Even as traditional markets for sweetgum 
lumber have declined, large section solid 
timbers traditionally used for bridge con- 
struction have become increasingly scarce. 
The article “Development of Novel Indus- 
trial Laminated Planks from Sweetgum 
Lumber” (J. of Bridge Engr. 2008: 64-66) 
described the manufacturing and testing of 
composite beams designed to add value to 
low-grade sweetgum lumber. Here is data 
on the modulus of rupture (psi; the article 


contained summary data expressed 


MPa): 


6807.99 
6981.46 
6906.04 
7295.54 
7422.69 


7637.06 
7569.75 
6617.17 
6702.76 
7886.87 


Verify 


6663.28 
7437.88 
6984.12 
7440.17 
6316.67 


6165.03 
6872.39 
7093.71 
8053.26 
7713.65 


the plausibility 
normal population distribution. 


6991.41 
7663.18 
7659.50 
8284.75 
7503.33 


in 


6992.23 
6032.28 
7378.61 
7347.95 
7674.99 


of assuming a 


. Estimate the true average modulus of 


rupture in a way that conveys informa- 
tion about precision and reliability. 

Predict the modulus for a single beam in 
a way that conveys information about 
precision and reliability. How does the 
resulting prediction compare to the 


estimate in (b)? 


27. The n = 26 observations on escape time 
given in Exercise 46 of Chapter | give a 
sample mean and sample standard devia- 
tion of 370.69 s and 24.36 s, respectively. 
Assume the population distribution of 
escape times is at least approximately 
normal. 


a. Calculate an upper confidence bound 
for population mean escape time using a 
confidence level of 95%. 
Calculate an upper prediction bound for 
the escape time of a single additional 
worker using a prediction level of 95%. 
How does this bound compare with the 
confidence bound of part (a)? 
Suppose that two additional workers 
will be chosen to participate in the 


c. 
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simulated escape exercise. Denote their 
escape times by X27 and Xo, and let 
Xnew denote the average of these two 
values. Modify the formula for a PI for a 
single x value to obtain a PI for Xnew, 
and calculate a 95% two-sided interval 


based on the given escape data. 


28. A study of the ability of individuals to walk 
in a straight line (“Can We Really Walk 
Straight?” Amer. J. Phys. Anthropol. 1992: 
19-27) reported the accompanying data on 
cadence (strides per second) for a sample of 


n = 20 randomly selected healthy men. 


95° 85 92.. i95 93.86 1.00 92 .85 
78 93 93 1.05 93 1.06 1.06 .96 .81 


A normal probability plot gives substantial 
support to the assumption that the popu- 
lation distribution of cadence is approxi- 


mately normal. 


a. Calculate and interpret a 95% confi- 
dence interval for population mean 


cadence. 


b. Calculate and interpret a 95% predic- 
tion interval for the cadence of a single 
individual randomly selected from this 


population. 


29. Return to the odometer reading scenario of 
Example 8.8. Calculate a prediction for an 
additional Boxster’s odometer reading in a 
way that provides information about preci- 
sion and reliability. The authors actually 
selected a 17th such vehicle and found its 
odometer reading to be 19,815. Is that 


consistent with your prediction? 


30. Exercise 85 of Chapter 1 gave 


following observations on a receptor bind- 
ing measure (adjusted distribution volume) 
for a sample of 13 healthy individuals: 23, 
39, 40, 41, 43, 47, 51, 58, 63, 66, 67, 


69, 72. 


31. 


194 
177 
187 
136 
198 
151 
176 


32. 
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a. Is it plausible that the population dis- 
tribution from which this sample was 
selected is normal? 

b. Predict the adjusted distribution volume 
of a single healthy individual by calcu- 
lating a 95% prediction interval. 


Here are the lengths (in minutes) of the 63 
nine-inning games from the first week of 
the 2001 major league baseball season: 


160 176 203 187 163 162 183 152 177 
151 173 188 179 194 149 165 186 187 
177, 187) 186 «187, 173) 136-150-173 s:173 
153) 152. 149 152 180 186 166 174 176 
193 218 173 144 148 174 163 184 155 
172 216 149 207 212 216 166 190 165 
158 198 


Assume that this is arandom sample of nine- 
inning games (the mean differs by 12 s from 
the mean for the whole season). 


a. Give a 95% confidence interval for the 
population mean. 

b. Give a 95% prediction interval for the 
length of the next nine-inning game. On 
the first day of the next week, Boston 
beat Tampa Bay 3-0 in a nine-inning 
game of 152 min. Is this within the 
prediction interval? 

c. Compare the two intervals and explain 
why one is much wider than the other. 

d. Explore the issue of normality for the 
data and explain how this is relevant to 
parts (a) and (b). 

A more extensive tabulation of ¢ critical 

values than what appears in this book shows 

that for the ¢ distribution with 20 df, the 

areas to the right of the values .687, .860, 

and 1.064 are .25, .20, and .15, respectively. 

What is the confidence level for each of the 

following three confidence intervals for the 

mean yp of a normal population distribution? 

Which of the three intervals would you 

recommend be used, and why? 
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33. 


16.35 
19.08 
17.48 
19.20 


34. 


a. (¥— .687s/V21,¥+ 1.725s/V21) 
(¥ — .860s/V21,%+ 1.325s/V21) 
c. (¥— 1.064s/V21,x%+ 1.064s/V/21) 


s 


The following data on distilled alcohol 
content (%) for a sample of 35 port wines 
was extracted from the article “A Method 
for the Estimation of Alcohol in Fortified 
Wines Using Hydrometer Baumé and 
Refractometer Brix” (Amer. J. Enol. Vitic. 
2006: 486-490): 


18.85 
19.62 
17.15 
18.00 


16.20 17.75 19.58 
19.20 20.05 17.85 
19.07 19.90 18.68 
19.60 19.33 21.22 


17.73 
19.17 
18.82 
19.50 


22.75 
19.48 
19.03 
15.30 


23.78 
20.00 
19.45 
22.25 


23.25 
19.97 
19.37 


a. Calculate and interpret a 99% confi- 
dence interval for the population mean. 

b. Calculate and interpret a 90% lower 
confidence bound for the population 
mean. 

c. It would be of interest to winemakers to 
obtain a prediction interval for the alco- 
hol content of an individual port wine. 
Why should we hesitate to apply the PI 
formula (8.14) to this data? [Hint: If you 
haven’t done so already, make a graph.] 


The article “Evaluating Tunnel Kiln Per- 

formance” (Amer. Ceramic Soc. Bull., Aug. 

1997: 59-63) gave the following summary 

information for fracture strengths (MPa) of 

n = 169 ceramic bars fired in a particular 

kiln: x = 89.10, s = 3.73. 

a. Calculate a (two-sided) confidence 
interval for true average fracture 
strength using a confidence level of 
95%. Does it appear that true average 
fracture strength has been precisely 
estimated? 

b. Suppose the investigators had believed a 
priori that the population standard 
deviation was about 4 MPa. Based on 


35. 


36. 


31. 
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this supposition, how large a sample 
would have been required to estimate py 
to within .5 MPa with 95% confidence? 


As health care costs rise and health care 
systems worldwide become overburdened, 
patients are made to wait longer for critical 
procedures. One Canadian study of 539 
cardiac patients waiting for cardiac bypass 
surgery found a mean wait time of 19 days 
with a standard deviation of ten days 
(“Wait Times Data Guide,” Ministry of 
Health and Long-Term Care, Ontario, 
Canada, 2006; wait time is measured from 
the date a patient was recommended for 
surgery to the date surgery was performed). 
Assuming the data can be considered rep- 
resentative of the Ontario population, con- 
struct a 90% CI for the true mean wait time 
for bypass surgery in Ontario. 


A sample of 66 obese adults was put on a 
low-carbohydrate diet for a year. The 
average weight loss was 11 Ib and the 
standard deviation was 19 lb. Calculate a 
99% lower confidence bound for the true 
average weight loss. What does the bound 
say about confidence that the mean weight 
loss is positive? 


A study was done on 41 first-year medical 
students to see if their anxiety levels 
changed during the first semester. One 
measure used was the level of serum cor- 
tisol, which is associated with stress. For 
each of the 41 students the level was 
compared during finals at the end of the 
semester against the level in the first week 
of classes. The average difference (end of 
semester minus beginning) was +2.08 with 
a standard deviation of 7.88. Find a 95% 
lower confidence bound for the population 
mean difference 4. Does the bound suggest 
that the mean population stress change is 
necessarily positive? 
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38. 


The article “Ultimate Load Capacities of 
Expansion Anchor Bolts” (J. Energy Engr. 
1993: 139-158) gave the following sum- 
mary data on shear strength (kip) for a 
sample of 3/8-in. anchor bolts: n= 78, 
X = 4.25, s = 1.30. Calculate a lower con- 
fidence bound using a confidence level of 
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Program” (Work 2003: 113-123) reported 
the following data for a sample of 131 sixth 
graders: for backpack weight (lb), 
xX = 13.83,s = 5.05; for backpack weight 
as a percentage of body weight, a 95% 
confidence interval for the population mean 
was (13.62, 15.89). 


90% for true average shear strength. a. Calculate and interpret a 99% CI for 
population mean backpack weight. 

b. Obtain a 99% CI for population mean 
weight as a percentage of body weight. 

c. The American Academy of Orthopedic 
Surgeons recommends that backpack 
weight be at most 10% of body weight. 
What does your calculation of (b) sug- 
gest, and why? 


39. University administrators wish to estimate 
the mean time to graduation, for the popu- 
lation of students who actually graduate, to 
within +3 months (one-quarter). It is 
known that the maximum time to graduation 
is eight years (96 months) and the mini- 
mum time is three years (36 months), so 
that a conservative estimate of the standard 
deviation of graduation times is 
range/4 = (96 — 36)/4 = 15 months. Use 
this standard deviation estimate to deter- 
mine the sample size required to achieve the 
administrators’ goal with 95% confidence. 
[Note: 3 months is not the desired interval 
width; it’s the target margin of error.] 


41. Refer to the discussion below Expression 
(8.12) concerning one-sided large-sample 
confidence bounds. Determine the confi- 
dence level for each of the following large- 
sample one-sided confidence bounds: 

a. Upper bound: ¥+ .84s/./n 
b. Lower bound: x — 2.05s/,/n 
c. Upper bound: x + .67s/./n 


40. Young people may feel they are carrying 
the weight of the world on their shoulders, 
when what they are actually carrying too 42 
often is an excessively heavy backpack. 
The article “Effectiveness of a School- 
Based Backpack Health Promotion 


. Use the results of Sections 6.3-6.4 to show 
that the variable T on which the PI is based 
does in fact have a ¢ distribution with n — 1 df. 


8.3 Intervals for a Population Proportion 

The previous section focused primarily on interval estimates for a population mean, wy. In this section, 
we consider some methods for constructing a CI for a proportion. Let p denote the proportion of 
“successes” in a population: the proportion of all students at your university that graduate, the 
proportion of all production items that meet manufacturer specs, the proportion of all laptops that do 
not need warranty service, etc. 

A random sample of n individuals or objects will be selected, and X will denote the number of 
successes in the sample. Provided that n is small relative to the population size, X can be regarded as a 
Bin(n, p) random variable. Moreover, as discussed in Chapter 6 in connection with the Central Limit 
Theorem, if both np > 10 and n(1 — p) > 10, X has approximately a normal distribution. 

A natural estimator of the parameter p is the statistic P = X/, the sample proportion of successes. 
As seen in Example 7.4, properties of X imply that 
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P(1—p) 


E(P)=p and op= 
n 


Also, since P is just X multiplied by the constant 1/n, P has an approximately normal distribution. 
Standardizing P by subtracting its mean and dividing by its standard deviation then implies that 


P # P=p Pe l-« 
Za/2 Za/2} ~ L— 
p(1—p)/n 


(This is only an approximate equality, because P is only approximately normal.) The standardized 


version of P is also an approximate pivotal quantity. Proceeding as suggested in the subsection 
“Deriving a General Confidence Interval” (Section 8.1), the confidence limits result from replacing 
each < by = and solving for p. These equations are quadratic; sparing the reader the details, the two 
roots are 


P+2,/2n — \/PUL-P)/nt+z2,/40? 
= a Zu 
. 1+Zi/n [2 1+Zi)/n 


PROPOSITION Let p denote the observed fraction of successes in a random sample of size 
n from a population with true success proportion p. Then an approximate 
100(1 — «)% confidence interval for p has endpoints 


[PU ~ B)/n + 2/40? 


(8.15) 


wm 
mm 


C Ly/2 


where p = [p + 22 jo/2n\/[1+ z j2/n]. This is often referred to as the 


score CI for p. 


Example 8.13 Anyone using e-mail or surfing the web (these days, virtually everyone!) has 
encountered phishing, fraudulent e-mails or websites designed to look legitimate and thus trick people 
into revealing credit card numbers, passwords, etc. The article “Is Domain Highlighting Actually 
Helpful in Identifying Phishing Web Pages?” (Human Factors 2016: 640-660) describes a study in 
which 320 participants were shown webpages and asked to identify which were legitimate and which 
were fraudulent. In one phase of the study, 157 participants misidentified a phishing website as safe. 
Let p denote the proportion of all web users that would misidentify this fraudulent website under the 
study’s settings. A point estimate for p is the sample proportion p = 157/320 = .491. Using the score 
interval (8.15), an approximate 95% confidence interval for p is 


1+(1.96)°/320 1+ (1.96)°/320 
= 491 + .054 = (.437, 545) 


(.491) + (1.96)? /2(320) 4 | 9¢ \/ (491) (.509) /320 + (1.96)?/4(320)? 
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With 95% confidence, we conclude that between 43.7% and 54.5% of all web users would fall for this 
fraudulent website under the study’s settings. 

The point of the article was to determine whether looking at a site’s URL in the browser’s address 
bar can help people detecting phishing sites. In the part of the study just described, participants could 
not see the URL; in a second phase, participants were shown different websites along with their 
addresses, and only 31.6% of them made the same mistake. Using (8.15) again, with 95% confidence 
we infer that between 26.7 and 36.8% of all web users would mistakenly think a particular fraudulent 
website was safe even if they could see its web address. a 


One-sided confidence bounds are available for p, and they are constructed in a similar fashion to 
those discussed in Section 8.1. To obtain an approximate 100(1 — «)% upper confidence bound for 
p, simply replace + with + and z,/2 with z, in (8.15). For a lower confidence bound, + becomes a — 
sign. 


The “Traditional” Interval for p 

If the sample size n is very large, then the terms z*/2n, and z’/n, and z7/4n? in (8.15) are generally 
quite negligible (small) compared to the other terms in the expression. Removing those “lesser” terms 
simplifies the score interval to 


: p(1 — p 
P+ 2x2 pup) (8.16) 


Expression (8.16), known as the Wald CI for p, is the one that for decades has appeared in intro- 
ductory statistics textbooks. It clearly has a much simpler and more appealing form than the score CI. 
So, why bother with (8.15)? 

Suppose we use Z.o25 = 1.96 in the Wald interval (8.16). Then our nominal confidence level (the one 
we think we’re buying by using 1.96) is approximately 95%. So, before a sample is collected, the 
chance that the random interval includes the actual value of p—i.e., the coverage probability—should 
be about .95. But, as Figure 8.8 shows for the case n = 100, the actual coverage probability for this 
interval can differ considerably from the nominal probability .95, particularly when p is not close to .5. 
(The graph of coverage probability versus p is very jagged because the underlying binomial distri- 
bution is discrete rather than continuous.) This is a serious deficiency of the Wald interval: the actual 
confidence level can be considerably less than the nominal level, even for fairly large-sample sizes. 


Coverage probability 


0.96 


P 
0 0.2 0.4 0.6 0.8 1 


Figure 8.8 Actual coverage probability for the Wald interval (8.16) for varying values of p when n = 100 
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Research has shown that the score interval (8.15) rectifies this behavior: for virtually all sample 
sizes and values of p, the actual confidence level will be quite close to the nominal level specified by 
the choice of z,/2. This is due largely to the fact that the score interval is shifted a bit toward .5 
compared to the Wald interval. In particular, the midpoint p of the score interval is always a bit closer 
to .5 than is the midpoint p of the Wald interval; this is especially important when p is close to 0 or 1. 


Sample Size Determination 
Equating the width of the CI for p to a prespecified width w gives a quadratic equation for the sample 
size n necessary to give an interval with a desired degree of precision. The solution is 


AA Aa\2 AA 
25/2 264 = V (2pq) + (1 — 4pq)w? 
_ 2 


3 (8.17) 


where g = 1 — jp. Neglecting the terms in the numerator involving w*, which will be quite small 
because w is a (typically small) decimal value, gives 


42 poPd 
n= 5) 
WwW 


This latter expression is what results from equating the width of the Wald interval to w. 

These formulas unfortunately include p, which a researcher does not know in advance of the study 
when trying to determine what sample size is required. The most conservative approach is to use the 
fact that both expressions are maximized when p = g = .5. Thus if pg = (.5)(.5) = .25 is used in 
(8.17), the width of the CI will be at most w regardless of what value of p results from the sample. 
Alternatively, if the investigator has a rough estimate po from some prior study, po can be used in 
place of p. 


Example 8.14 Using the conservative method p = q = .5 proposed above, the sample size formula 
(8.17) simplifies to 


2 2025) gr 4 (2.25)? +(1- 4(25))"| z E —w + Vs" 26) 
= 5) = 


2 2 


n= 


W W W 


The width of the 95% CI in Example 8.13 is .108. The value of n necessary to ensure a width of no 
more than .08, irrespective of the value of p, is 


1.962(1 — .08? 
7 — LIC (= 08") 


ogE = 598-4 


Thus, a sample size of 597 should be used. The expression for n based on the Wald CI gives a slightly 
larger value of 601. Ho 


Alternative Intervals for p 

Even the score interval (8.15) is not perfect: because it relies on a normal approximation to the 
binomial, it isn’t necessarily suited to situations where that approximation is poor (although it fares 
much better than the Wald interval!). Interval estimation methods exist that do not rely on this normal 
approximation and, hence, are reliable for all values of n and p. The so-called exact method, based on 
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the binomial distribution, is guaranteed to produce an interval having coverage probability at least as 
great as the nominal confidence level. But, the exact method tends to produce CIs that are very wide— 
an undesirable property for a confidence interval. 

These issues have been largely resolved by research reported in the 2014 article “A Coverage 
Probability Approach to Finding an Optimal Binomial Confidence Procedure” (The Amer. Stat., 
Schilling and Doi). The article describes a computer-intensive method for determining what the authors 
call the length/coverage optimal interval for p. That is, for any given sample data, their method 
produces an interval that (1) has coverage probability at least as great as the specified confidence level, 
regardless of the value of p, and (2) is the shortest among all such intervals. There is no explicit 
“formula” for the length/coverage optimal interval, but the article’s authors have created online software 
to compute the interval automatically. This can be accessed at http://shiny.stat.calpoly.edu/LCO-CI/. 


Example 8.15 The Super Bowl is one of the most watched events in the country, but not everyone 
watches it. Imagine that in a sample of 25 students at a university, all 25 said they watched the most 
recent Super Bowl. What can be said about the parameter p = the proportion of all students at this 
university who watched the game? Though p = 25/25 = 1, it seems unrealistic to infer that 100% of 
all students saw the most recent Super Bowl. 

The Wald interval clearly should not be applied here; doing so results in a CI of 1 +0 =1, 
suggesting p is known to be 100% exactly (which, again, is just silly). Using the Schilling-Doi website, 
with n = 25 and x = 25 the length/coverage optimal 95% confidence interval for p is (.866, 1). That is, 
we are 95% confident that between 86.6% and 100% of all students at this university watched the most 
recent Super Bowl. The score interval based on this data is nearly identical, (.867, 1), suggesting that 
even for small-sample sizes the score interval is a wise choice. 


Exercises: Section 8.3 (43-56) 


43. According to Oklahoma State University’s 45. A June, 2019 Gallup survey of 1018 ran- 


2015 Food Demand Survey, 859 of 1044 domly selected US adults found that 53% 
randomly selected adults support manda- supported the government sponsoring a 
tory labels on foods produced with genetic manned mission to Mars. 
engineering (popularly called GMO prod- a. Construct and interpret a 95% lower 
ucts). Construct an approximate 957% con- confidence bound for the proportion of all 
fidence interval for the proportion of all US adults who support such a mission. 
adults who support mandatory labeling of b. Does your answer to part (a) clearly 
GMO products. indicate that a majority of all US adults 
44, In a survey of 1100 drivers, 90% admitted ay way (at least as of June, 2019)? 


to careless or aggressive driving during the 
previous six months (“Nine out of Ten 46. The article “Teens and Young Adults 


Drivers Admit in Survey to Having Done Embrace Online Multiplayer and Competi- 
Something Dangerous,” Knight Ridder tive Video Games” (Washington Post, Apr. 
Newspapers, Jul. 8, 2005). Assuming these 3, 2018) reported that, in a survey of 522 
1100 drivers may be treated as a random Americans age 14—21, 38% said they con- 
sample of all drivers in the USA, construct sider themselves a fan of competitive 
and interpret a 95% CI for the true gaming. Construct a 99% confidence 
proportion of drivers who have engaged interval for the proportion of all Americans 
in “dangerous” driving in the past six in this age group that are fans of esports or 


months. competitive gaming. 


480 


47. 


48. 


49. 


50. 


As reported by CNBC (Dec. 11, 2018), 
57% of people surveyed admitted to shop- 
ping online while at work. The survey was 
based on a random sample of n = 2020 US 
adults. Construct and interpret a 90% upper 
confidence bound for the proportion of all 
US adults who shop online while at work. 


The article “Repeatability and Repro- 
ducibility for Pass/Fail Data” (J. Testing 
Eval. 1997: 151-153) reported that in 
n = 48 trials in a particular laboratory, 16 
resulted in ignition of a particular type of 
substrate by a lighted cigarette. Let p de- 
note the long-run proportion of all such 
trials that would result in ignition. 


a. Use the score interval (8.15) to construct 
a 95% CI for p. 

b. Use the Wald interval (8.16) to con- 
struct a 95% CI for p. 

c. How do the intervals in parts (a) and 
(b) compare? Is the narrower interval 
preferable here? Why or why not? 


The article “Limited Yield Estimation for 
Visual Defect Sources” (EEE Trans. 
Semicon. Manuf: 1997: 17-23) reported 
that, in a study of a particular wafer 
inspection process, 356 dies were examined 
by an inspection probe and 201 of these 
passed the probe. Assuming a stable pro- 
cess, calculate a 95% (two-sided) confi- 
dence interval for the proportion of all dies 
that pass the probe. 


There is increasing concern within the 
health sciences community over the use of 
electronic tobacco products. The article 
“Exposure to Tobacco and Nicotine Pro- 
duct Advertising: Associations with Per- 
ceived Prevalence of Use Among College 
Students” (Amer. J. College Health, Vol- 
ume 66, 2018, Issue 8) reported on a study 
based on the Texas College Tobacco Pro- 
ject survey administered in 2016. In a 
sample of 5767 undergraduates ages 18-25, 
9.1% said they had used electronic cigar- 
ettes at least once during the previous 
30 days. Calculate and interpret a confi- 
dence interval using a 99% confidence level 
for the proportion of all students in the 


51. 


52. 


53. 
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population sampled who used e-cigarettes 
during the previous 30 days. 


The article “Broad Agreement on Most 
Ideas to Curb School Shootings” from the 
Gallup.com website reported on a survey 
carried out from March 5-11, 2018. There 
was overwhelming support for more train- 
ing of school and security personnel, and 
for background checks for all gun sales. 
However, opinion was more evenly split on 
the issue of arming teachers; 42% of the 
1515 adults in the sample favored provid- 
ing teachers with weapons. Using a 95% 
confidence level, calculate an upper confi- 
dence bound for the percentage of all adults 
in the USA who favor arming teachers. 
Based on your interval, can you be confi- 
dent that a majority of the population does 
not favor such a policy? 


In a sample of 1000 randomly selected 
consumers who had opportunities to send in 
a rebate claim form after purchasing a pro- 
duct, 250 of these people said they never did 
so (“Rebates: Get What You Deserve,” 
Consumer Reports, May 2009: 7). Reasons 
cited for their behavior included too many 
steps in the process, amount too small, 
missed deadline, fear of being placed on a 
mailing list, lost receipt, and doubts about 
receiving the money. Calculate an upper 
confidence bound at the 95% confidence 
level for the true proportion of such con- 
sumers who never apply for a rebate. Based 
on this bound, is there compelling evidence 
that the true proportion of such consumers is 
smaller than 1/3? Explain your reasoning. 


A state legislator wishes to survey residents 
of her district to see what proportion of the 
electorate is aware of her position on using 
state funds to pay for abortions. 


a. What sample size is necessary if the 
95% CI for p is to have width of at most 
.10 irrespective of p? 

b. If the legislator has strong reason to 
believe that at least 2/3 of the electorate 
know of her position, how large a sam- 
ple size would you recommend? 
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54. A mortgage company wishes to estimate made anonymous to encourage truthful 
the proportion of all borrowers who default responses). Assuming these 57 students 
on their home loans, to within a margin of comprise a random sample of all business 
error of 2 percentage points (+.02). What students at the school, and assuming students 
sample size is required to achieve this at the answered truthfully, estimate with 95% con- 
90% confidence level? How does the fidence the proportion of all business students 
answer change if it is believed initially that who have been arrested on such charges. 
roughly 15% of all customers default? 56. Reconsider the score CI (8.15) for p, and 

55. A recent student project asked students at Cal focus on a confidence level of 95%. Show 
Poly (where one of the authors teaches) that the confidence limits agree quite well 
whether they have ever been arrested on with those of the Wald interval (8.16) once 
alcohol- or drug-related charges, including two successes and two failures have been 
drunk driving. Out of 57 students surveyed in appended to the sample, i.e., (8.16) based 
the College of Business, only four reported on (x+2) S’s in (n +4) trials. [Hint: 
they had been arrested (the surveys were 1.96 = 2.] 


8.4 Confidence Intervals for the Population Variance and Standard 
Deviation 


Although inferences concerning a population variance o° or standard deviation o are usually of less 
interest than those about a mean or proportion, there are occasions when such procedures are needed. 
In the case of a normal population distribution, inferences are based on a result from Section 6.4 
concerning the sample variance $*: if X;, ..., X, is a random sample from a normal distribution with 
variance o’, then the random variable 


(n — 1)S? 
has a chi-squared (77) distribution with n — 1 df. 

As discussed in Section 6.3, the chi-squared distribution is a continuous probability distribution 
with a single parameter v, the number of degrees of freedom. To specify inferential procedures that 
use the chi-squared distribution, recall the notation for critical values from Section 6.3. 


NOTATION Let ye called a chi-squared critical value, denote the number on the measurement 
axis such that « of the area under the chi-squared curve with v df lies to the right of ae 
See Figure 8.9a. 


It was necessary to tabulate only upper-tail critical values for the ¢ distribution (¢,, for small values of 
a), because t density curves are symmetric. But chi-squared distributions are not symmetric, so 
Appendix Table A.5 contains values of 2, for « both near 0 and near 1, as illustrated in Figure 8.9b. 
For example, 75 14 = 26.119 and 1356 (the 5th percentile) = 10.851. 

The rv (n — 1)S7/o? in (8.18) satisfies the two properties of being a pivotal quantity: it is a function 
of the parameter of interest o, yet its probability distribution, eis does not depend on this parameter. 
So, the methods described in Section 8.1 can be applied to this rv in order to construct a confidence 
interval for o. Analogous to Figure 8.9b, the area under a 72_, curve to the right of ‘e /2,n—1 is o/2, as 
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Each shaded 
a b area = .O1 


Was pdf 


Shaded area = & 


| a 


0p Oly 


Figure 8.9 72 , notation illustrated 


is the area to the left of a /2n—1" Thus the area captured between these two critical values is 1 — a, 


from which we may infer 


n—1)S? 
P(t foe as a < Zap) =l-a (8.19) 


The inequalities in (8.19) are equivalent to 


(n — 1)S? Bae (n — 1)S? 


2 2 
Xx/2,n—1 X\—x/2,n—1 


Substituting the computed value s° into the limits gives a CI for o°, and taking square roots gives an 
interval for o. 


PROPOSITION A 100(1 — «)% confidence interval for the variance o” of a normal popula- 
tion is given by the endpoints 


—1 —1 
-s’, a -s (8.20) 
Xx/2,n—1 X\—x/2,n—1 


A 100(1 — «)% confidence interval for o has lower and upper limits that are the 
square roots of the corresponding limits in (8.20). 

An upper confidence bound for o° is obtained from the right endpoint of (8.20) 
by substituting « for «/2 in the 7’ critical value; taking the square root of that 
quantity results in an upper confidence bound for o. The left endpoint of (8.20) 
can be modified similarly to achieve lower confidence bounds. 


Recall from Section 6.3 that the expected value of a chi-squared rv is its df; here, df =n—1. Asa 
result, the upper critical value re /2,n-1 should exceed n — 1, and so the fraction appearing in the left 
endpoint of (8.20) should be less than 1. Similarly, the fraction in the right endpoint should be greater 


than 1, so that the interval’s endpoints lie on either side of the point estimate s” (albeit not equidistant 
from s°). 
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Example 8.16 The report “Strand Debonding for Pretensioned Girders” (VCHRP Research Report 
849: 2017) includes the following yield strength measurements (ksi) for a sample of 16 reinforcing 
bars of the type commonly used in bridges: 


67.6 63.6 69.2 82.1 69.7 79.0 67.1 65.9 
79.1 65.4 70.5 75.6 72.7 63.6 70.1 75.6 


Figure 8.10 shows a normal probability plot, which indicates that a normal model for the population 
of yield strength measurements is plausible. 


Percent 
wn 
J 


60 65 70 7s 80 85 
Yield strength (ksi) 


Figure 8.10 Normal probability plot of the yield strength data in Example 8.16 


Let o denote the true standard deviation of the yield strength distribution. The computed value of 
the sample sd is s = 5.47; this is a point estimate of o. With df = 16 — 1 = 15, a 95% CI requires the 
critical values 75 ;5 = 27.488 and X58 = 6.262. The resulting interval for o” is 


16= 1 2 16= 1 
(sa 64)", 6362 


(547) = (16.32, 71.67) 


Taking the square root of the endpoints yields (4.04, 8.47) as the 95% CI for o. At the 95% 
confidence level, the true standard deviation of yield strength for this type of bridge reinforcing bar is 
between 4.04 and 8.47 ksi. a 


The confidence interval illustrated in the preceding example relies heavily on the normality 
assumption. Research has shown that using (8.20) with data from nonnormal populations can result in 
highly unreliable intervals, even when the sample size n is large. (For example, the coverage prob- 
ability of an ostensible 95% CI can be far less than .95.) A more robust method is presented in the 
article “Approximate Confidence Interval for Standard Deviation of Nonnormal Distributions” 
(Comp. Stat. & Data Analysis 2006: 775-782) by D. Bonett. This interval, though typically wider 
than (8.20), has been shown to achieve much better coverage probability for a wide variety of 
nonnormal population distributions. It is now incorporated into the Minitab software package, which 
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produces (8.20) and Bonett’s interval when users request a CI for population variance. See Exercise 
92 for more information. 

Alternatively, the bootstrap method presented in the next section can produce an interval estimate 
for population standard deviation (or variance) without requiring population normality. 


Exercises: Section 8.4 (57-62) 


57. Determine the values of the following 61. Here are the names of 12 orchestra con- 


quantities: ductors and their performance times in 

Be Hous, De 4 ge © Vo ax Ge Fogg minutes for Beethoven’s Ninth Symphony: 
e. 0585s Bernstein 71.03 Furtwangler 74.38 
Leinsdorf 65.78 Ormandy 64.72 
58. Determine the following: Solti 74.70 Szell 66.22 
. 7 Bohm 72.68 Karajan 66.90 
a. The 95th percentile of the Yj) = Masur 69.45 Rattle 69.93 
distribution Steinberg 68.62 Tennstedt 68.40 


b. The Sth percentile of the yj distribution 
c. P(10.98 < Y < 36.78), where Y is a 


2 
Yon TV 
d. P(Y < 14.611 or Y> 37.652), where 


Y is ayes Iv 


a. Check to see that normality is a rea- 
sonable assumption for the performance 
time distribution. 

b. Compute a 95% CI for the population 
standard deviation, and interpret the 


59. Exercise 17 provided alcohol percentage interval. 
data for a sample of 16 beers. The sample c. Supposedly, classical music is 100% 
standard deviation of those measurements determined by the composer’s notation, 
was s = .8483. Construct a 90% CI for the including all timings. Based on your 
population variance o° of alcohol percent- results, is this true or false? 


age in beers, and then a 90% CI for o. 


62. Refer to the baseball game times in Exer- 
cise 31. Calculate an upper confidence 
bound with confidence level 95% for the 
population standard deviation of game 
time. Interpret your interval. Explore the 
issue of normality for the data and explain 
how this is relevant to your interval. 


60. Exercise 24 gave a random sample of 20 
ACT scores from students taking college 
freshman calculus. Calculate a 99% CI for 
the standard deviation of the population 
distribution. Is this interval valid whatever 
the nature of the distribution? Explain. 


8.5 Bootstrap Confidence Intervals 


How can a confidence interval for the mean be constructed if the population distribution is not normal 
and the sample size n is small? Can we find confidence intervals for other parameters, such as the 
population median or the 90th percentile of the population distribution? The bootstrap, developed by 
Bradley Efron in the late 1970s, facilitates calculating estimates in situations where statistical theory 
does not produce a formula for a confidence interval. The method substitutes heavy computation for 
theory, and many statistical software packages now implement various bootstrap methods (this 
includes SAS, R, JMP Pro, and Minitab). The parametric bootstrap, for applications with a known 
(or assumed) population distribution, was briefly mentioned in Section 7.1. In this section we are 
concerned with the case of an unknown distribution, for which the nonparametric bootstrap is 
appropriate. 
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The Bootstrap Method 

Traditional inference (e.g., the presentation in Sections 8.1—8.4) relies on the sampling distribution of 
a statistic—the distribution of the values of that statistic if we were to hypothetically take all possible 
random samples of size n from the parent population. This is illustrated for the sample mean in 
Figure 8.11a. In contrast, the bootstrap method considers what would happen if we were to draw 
repeatedly from the sample at hand. For example, if we had n = 15 observations from some popu- 
lation and we were interested in drawing inferences about the mean, the bootstrap distribution of X 
would consist of all x values that could be obtained by taking a random sample of size 15 (called a 
bootstrap sample or resample) from the original 15 observations. Obviously, for that to make sense, 
bootstrap sampling must occur with replacement, otherwise, we would get the same sample over and 
over again. Figure 8.11b diagrams the basic bootstrap method. 


a _ Sample #1 ) —— x, 
Population _— _( Sample #2) ——»> x 


en 


Ds 
Resample #1 x, 

ge 
Population — _—— _ Resample #2.) ——> x 


ares 


Figure 8.11 Two versions of sampling variability: (a) creating the sampling distribution of X; (b) creating the 
bootstrap distribution of X 


Philosophically, the bootstrap method treats the sample at hand as if it were the population, since 
the sample represents in a sense everything the user knows about the underlying population. Again, 
the advantage of bootstrapping is that the method applies in the absence of theory (e.g., the CLT) or 
distributional requirements (e.g., normality). The steps in the basic bootstrap method are as follows. 


BASIC BOOTSTRAP = Suppose we wish to generate the bootstrap distribution of a statistic 0 based 


METHOD upon an observed sample x,, x2, ..., X, from some population. 
1. Take a random sample of size n with replacement from x1, X2, ..., Xn 
: : x yx * 
resulting in x},x5,...,X;. 


2. Compute the value of the statistic 0 from this bootstrap sample; label the 


resulting value (*. 
3. Repeat steps 1 and 2 a large number of times (say, B times), giving values 


0%, 05, aie 0% for the statistic of interest. 


These values 0%, e. or 0% approximate the bootstrap distribution of 0. 
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We use the word “approximate” above only because the process terminates after obtaining B re- 


samples and B resulting values 0°. The complete bootstrap distribution of 0 consists of all 6* values 
from all possible bootstrap samples, but the number of such samples can be unwieldy for even 


moderate sample sizes. It can be shown that the number of bootstrap samples from an original sample 


of size n is ‘ee, 


bootstrap samples increases rapidly with n. In practice, B = 1000 is often used. 
It has been shown experimentally that the bootstrap distribution of a statistic quite often resembles 


a: even for n = 15, this is more than 77 million, and the number of possible 


the actual sampling distribution of that statistic. In particular, the standard error of a statistic 0, Gj, can 
often be well approximated by its bootstrap standard error, defined to be the sample standard 
deviation of the 0*’s: 


Spoot = i= es (0; a") (8.21) 


The symbol 6* in (8.21) denotes the mean of the bootstrap values of 0, i.e., 0* = S> 0° /B. In the 
bootstrap literature, B is sometimes used in place of B — 1 in (8.21); for typical values of B, there is 
usually little difference between the resulting estimates. 


Example 8.17 In a student project, Erich Brandt studied tips at a restaurant. Here is a random 
sample of 30 observed tip percentages: 


22.7 16.3 13.6 16.8 29.9 15.9 14.0 15.0 14.1 18.1 22.8 27.6 16.4 16.1 19.0 
13.5 18.9 20.2 19.7 18.2 15.4 15.7 19.0 11.5 18.4 16.0 16.9 12.0 40.1 19.2 


We would like to get a confidence interval for , the population mean tip percentage at this 
restaurant. However, this is not a very large sample and there is a problem with positive skewness, as 
shown in the normal probability plot of Figure 8.12. Most of the tips are between 10 and 20%, but a 
few big tips cause enough skewness to invalidate the normality assumption. The one-sample ¢ interval 
applied to this data would not be trustworthy. 


Mean 18.43 
40 5 @| |StDev 5.761 
N 30 
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Figure 8.12 Normal probability plot (from Minitab) of the tip percentages 
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To implement the bootstrap method here, regard the 30 observations as constituting a population. 
Then take a large number of random resamples with replacement, each of size 30, from this “pop- 
ulation.” For each of these resamples compute the sample mean (because the population mean is the 
parameter of interest). Then use the distribution of these resample means to get a confidence interval 
for the population mean (we’ll explain how shortly). To help get a feeling for how this works, here is 
the first of B = 1000 resamples generated using software 


22.8 16.8 16.0 19.0 19.2 20.2 13.6 15.9 22.8 11.5 15.9 14.0 29.9 19.2 16.0 
27.6 14.1 13.5 16.8 15.4 20.2 16.4 20.2 16.9 16.8 22.8 19.7 18.2 22.7 18.2 


That is, x} = 22.8, x5 = 16.8, ..., 439 = 18.2 for this bootstrap sample. Notice that some values from 
the original sample are repeated (due to sampling with replacement), while some values don’t appear 
at all. This first bootstrap sample has mean x} = 18.41; the asterisk emphasizes that this is the mean 
of a bootstrap resample and not of the original sample of 30 tip percentages. This process was 
repeated 1000 times, resulting in resample means x7’, .. ., Xj999- Figure 8.13 displays a histogram of 
these 1000 x* values, the approximate bootstrap distribution of the statistic X. Notice that the 
bootstrap distribution of X is somewhat right-skewed, inconsistent with a normal distribution. 
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Figure 8.13 Histogram of the bootstrap distribution of X for Example 8.17 a 


In Example 8.17, the mean for the original 30 observations is ¥ = 18.43. On the other hand, the 
mean of the bootstrap distribution displayed in Figure 8.13 is 18.416. This is due to only taking 1000 
bootstrap samples; it can be shown that the complete bootstrap distribution of X will always be 
centered at the mean of the original sample. However, this is not the case for other statistics. For 
example, the mean value of the bootstrap distribution of a trimmed mean is not necessarily the “true” 
value of Xi, (i.e., the trimmed mean of the original sample). The bias of a bootstrap distribution is 
defined to be the difference between these two values. In practice, if this bias is small relative to the 
magnitude of the data itself, there is little cause for concern. 
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Example 8.18 As data proliferates across every business sector, data base administrator (DBA) has 
become an increasingly lucrative career choice. Figure 8.14a shows a histogram of the salaries for 
115 DBAs with 0-2 years of experience (“The 2019 Data Professional Salary Survey Results,” www. 
brentozar.com). Because some salaries are unusually high (both in the sample and the population), a 
10% trimmed mean might be considered an appropriate measure of center. To estimate the population 
trimmed mean, we must first understand the variability of the statistic X,,; the bootstrap method is 
appropriate because a theoretical description of the sampling distribution of trimmed means is not 
available. Figure 8.14b shows the bootstrap distribution of X;, based on B = 1000 resamples. 
Interestingly, this bootstrap distribution appears to be approximately normal. 
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Figure 8.14 Graphs for Example 8.18: (a) histogram of 115 salaries; (b) bootstrap distribution of X,, from this sample 
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The 10% trimmed mean for the original 115 salaries is x, = $86,320, while the center (i.e., mean) 
value of the associated bootstrap distribution is $86,422. The bias of —$102 in this simulated boot- 
strap distribution is small relative to the salaries themselves, suggesting the bootstrap distribution will 
be a reasonable tool for inference on the population trimmed mean. a 


Bootstrap Interval Estimation 

Once we have the bootstrap distribution of a statistic, several different methods can be used to obtain 
a confidence interval for the corresponding parameter. If, as in Example 8.18, the bootstrap distri- 
bution of the statistic appears reasonably bell-shaped, then a variation on the ¢ interval from Sec- 
tion 8.2 may be employed. Recall that the ¢ interval for a mean p, assuming a normal sampling 
distribution for X, is ¥+t,_1» /2° s/./n; the s/./n term is the estimated standard error of X. By 


analogy, a confidence interval for a parameter 0 based on a bootstrapped statistic 0 could be obtained 
by replacing X in the one-sample ¢ interval with the calculated value of 0 from the sample and 


replacing s/./n with the bootstrap standard error of 0. 


DEFINITION Suppose we wish to estimate a parameter 0 by the corresponding sample statistic 0. 
A bootstrap ¢ confidence interval for 0 with confidence level 100(1 — «)% is 


BE tyjo naw * Shot (8.22) 


where the value of the statistic 0 in (8.22) is obtained from the original sample. 

The bootstrap ¢ confidence interval is appropriate when the bootstrap distri- 
bution of the statistic is approximately normal and the bias of the bootstrap dis- 
tribution is small. 


Example 8.19 (Example 8.18 continued) Let’s construct an interval estimate for the parameter 
Hy, the population 10% trimmed mean salary for all database administrators with O-2 years of 
experience. Since the bootstrap distribution of X;, in Figure 8.14b appears approximately normal, we 
may reasonably apply the bootstrap ¢ interval. 

The 10% trimmed mean of the 115 salaries in the sample is X, = $86,320. The bootstrap standard 
error of X;,— that is, the sample standard deviation of the bootstrap values displayed in Figure 8.14b— 
iS Spoor = $2994 (the software that performed the bootstrapping provided this value). A 95% 
confidence level requires f.925,114 = 1.981, giving a CI of 


86,320 + 1.981(2994) = 86,320 + 5931 = (80,389, 92,251) 


We are 95% confident that the 10% trimmed mean for the salary distribution of this population of 
DBAs is between $80,389 and $92,251. i} 


If the bootstrap distribution of a statistic is not normal, this casts doubt on the normality of its 
sampling distribution and suggests that a z- or ¢-based interval is not appropriate. Instead, we can use 
percentiles of the bootstrap distribution itself to form an interval. After all, critical values such as 
Zz = £1.96 are used because they bound the “middle 95%” of a certain standardized distribution. Even 
if a distribution is not symmetric, we can still identify the endpoints of the “middle 95%” of a 
distribution. 
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DEFINITION Suppose we have the bootstrap distribution of a statistic that estimates a certain 
parameter. A 95% confidence bootstrap percentile interval for that parameter 
has endpoints equal to the 2.5th percentile and the 97.5th percentile of this 
bootstrap distribution. 

Similarly, a bootstrap percentile interval with confidence level 100(1 — «)% for 
a parameter has endpoints equal the o/2 and 1 — «/2 quantiles of the bootstrap 
distribution of the corresponding statistic. 

Bootstrap percentile intervals are appropriate when the bias of the bootstrap 
distribution is low. 


The .025 and .975 quantiles of a bootstrap distribution must be estimated from the B bootstrap 
resamples actually obtained. For B = 1000 resamples, one typically uses the 25th-smallest and 25th- 


largest values among 0%, 05, sar Cons that is, the endpoints of the 95% confidence bootstrap per- 


centile interval are the 25th and 976th ordered 0° values. A similar approach may be applied to other 
values of B and other confidence levels. 


Example 8.20 (Example 8.17 continued) Figure 8.13 shows the approximate bootstrap distribution 
of x based on B = 1000 bootstrap resamples. The distribution does not appear normal. The 25th- 
smallest and 25th-largest of the 1000 x* values are 16.56 and 20.58, respectively. Thus, with 95% 
confidence, we estimate that the true mean tip at the restaurant where Erich worked is between 
16.56% and 20.58%. a 


A Refined Interval 

Some caution must be taken when using a percentile interval. It is known that percentile intervals 
sometimes have lower confidence levels than advertised. When the bootstrap distribution is skewed, 
bias tends to be greater, and the percentile intervals are not equally likely to “miss” the value of a 
parameter on the high and low sides. A somewhat sophisticated adjustment to the traditional per- 
centile interval corrects for these problems: the bias-corrected and accelerated (BCa) interval is 
almost always superior to the basic percentile interval and should be used whenever software is 
available. BCa intervals are generally accurate unless the sample size is extremely small. 

The acceleration aspect of the BCa interval is an adjustment for dependence of the standard error 
of the estimator on the parameter that is being estimated. For example, suppose we are trying to 
estimate the mean in the case of exponential data. In this case the standard deviation is equal to the 
mean, and the standard error of X is o/,/n = /,/n, so the standard error of the estimator X depends 
strongly on the parameter yu that is being estimated. If the histogram in Figure 8.13 resembled the 
exponential pdf, we would expect the BCa method to make a substantial correction to the percentile 
interval. 


Bootstrapping the Median 

The sample median X is less sensitive than X to the influence of individual observations. For the 30 
tip percentages in Example 8.17, the median is 16.85, substantially less than the mean of 18.43. The 
mean is pulled upward by the few large values, but these extremes have little effect on the median. 
Unfortunately, it is more difficult to get confidence intervals for the population median than for the 
mean, in part because we can easily estimate the standard error of a sample mean (s/,/n) but no 
analogous formula exists for the sample median. 
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Example 8.21 (Example 8.17 continued) Let’s use the bootstrap method to get a confidence interval 
for the true median tip percentage, jt. As before, 1000 resamples of the original 30 observations are 
taken with replacement, but now for each resample the sample median x* is calculated. A histogram 
of the bootstrap medians X},...,Xjggq 18 shown in Figure 8.15. 
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Figure 8.15 Histogram of the bootstrap medians for Example 8.21 


It should be apparent that the distribution of the 1000 bootstrap medians is far from normal. As is 
often the case with the median, the bootstrap distribution takes on just a few values and there are 
many repeats. Instead of 1000 different values, as would be expected if we took 1000 samples from a 
true continuous distribution, here there are only a handful of distinct values. 

Because the bootstrap distribution is so nonnormal, we should use the percentile interval in which 
the confidence limits for a 95% CI are taken from the 2.5 and 97.5 percentiles of the bootstrap 
distribution. When the 1000 bootstrap medians displayed in Figure 8.15 are sorted, the 25th value is 
15.95 and the 976th value is 18.90, so the 95% confidence interval for the population median is 
(15.95, 18.90). H 


We should be a bit uncomfortable with the results of bootstrapping the median. Given that the 
bootstrap distribution takes on just a few values but the true sampling distribution is continuous, we 
should worry a little about how well the bootstrap distribution approximates the true sampling 
distribution. On the other hand, the situation here is nowhere near as bad as it could be. Sometimes, 
especially when the sample size is smaller, the bootstrap distribution has far fewer values. 

Exercise 88 presents an alternative method for constructing a confidence interval for a population 
median that can be applied to data from any continuous distribution, irrespective of the sample size. 


Further Comments on Bootstrapping 

Is the bootstrap guaranteed to work, or is it possible that the method can give grossly incorrect 
estimates? The key here is how closely the original sample represents the whole distribution of the 
random variable X. When the sample is small, then there is a possibility that important features of the 
distribution are not included in the data set. In Example 8.17 the value 40.1% is highly influential. If 
we drew another sample of 30 observations independent of this sample, the luck of the draw might 
give no values above 25, and the sample would yield very different conclusions. The bootstrap is a 
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useful method for making inferences from data, but it is dependent on a good sample. If this is all the 
data that we can get, we will never know how well our sample represents the distribution, and 
therefore how good our answer is. Of course, no statistical method will give good answers if the 
sample is not representative of the population. 


Exercises: Section 8.5 (63-70) 


63. In a survey, students gave their study time e. For the study hours data, state your 
per week (h), and here are the 22 values: preference between the median and the 
15.0 10.0 10.0 15.0 25.0 7.0 30 80 10.0 mean, and explain your reasoning. 


10.0 11.0 7.0 5.0 150 7.5 7.5 12.0 7.0 


pee) ae Ee ee 65. Here are 68 weight gains (Ib) for pregnant 


women from conception to delivery 
(“Classifying Data Displays with an 
Assessment of Displays Found in Popular 


A 95% confidence interval for the popula- 
tion mean wp is desired. 


a. Compute the t-based confidence interval Software,” Teach. Statist., Autumn 2002: 
of Section 8.2. 96-101). A 95% CI for the population 
b. Create a normal probability plot. Is it mean weight gain y is desired. 


apparent that the data set is not normal, 


so the t-based interval is of questionable 25 14 20 38 21 22 36 38 35 37 
et 35 24 31 28 25 32 23 30 39 26 
validity? 38 «20: «21 «ls35 42 BS 523 


c. Use software to generate a bootstrap 43 38 21 76 22 2 10 19 25 25 
sample of means. Create a histogram of 2 31 34 36 35° 330 24 44 3543 


7 32 25 #27 «#31 «14 «42250 «1606 «(2547 
the resulting X* values. 35-14 65 40 35 45 #27. «224 
d. Use the standard deviation for part (c) to 
get a 95% bootstrap ¢ confidence interval a. Compute the t-based confidence interval 
for pt. Based on the histogram in part (c), of Section 8.2. 
is this CI valid? b. Check for normality to see if part (a) is 
e. Use part (c) to form the 95% confidence valid. Is the sample large enough that the 
bootstrap percentile interval for p. interval might be valid anyway? 
f. Which interval should be used, and c. Use software to generate a bootstrap 
why? sample of means. Create a histogram of 
the resulting x* values. 

64. Consider obtaining a 95% _ confidence d. Use the standard deviation for part (c) to 
interval for the population median j of the get a 95% bootstrap ¢ confidence interval 
study hours data in the previous exercise. for yz. Based on the histogram in part (c), 
a. Use software to generate a bootstrap is this CI valid? 

sample of medians. e. Use part (c) to form the 95% confidence 
b. Use the standard deviation for part (a) to bootstrap percentile interval for 0. 

get a 95% bootstrap ¢ confidence inter- f. Compare all three intervals. [Note: If 

val for ju. they are all close, then the bootstrap 
c. Investigate the distribution of the boot- supports the CI of part (a).] 

strap medians and discuss the validity of ; F . : 

part (b). 66. Consider again the weight gain data from 
d. Use the results of part (a) to form a 95% he Pius Exons, 

confidence bootstrap percentile interval a. Use the method of Section 8.4 to obtain 


for jl. a 95% confidence interval for o. Discuss 


67. 
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normality for the weight gain data: do 
you have reason to be concerned about 
the validity of this CI? 

b. Use software to generate a bootstrap 
distribution of standard deviations. 
(That is, generate many resamples from 
the given data, and for each one com- 
pute the sample standard deviation s;.) 

c. Use the bootstrap standard deviation for 
part (a) to get a 95% bootstrap ¢ confi- 
dence interval for o. 

d. Investigate the distribution of the boot- 
strap standard deviations and discuss the 
validity of part (c). 

e. Use part (b) to form the 95% confidence 
bootstrap percentile interval for oc. 


Nine Australian soldiers were subjected to 
extreme conditions, which involved a 100- 
min walk with a 25-lb pack when the 
temperature was 40 °C (104 °F). One of 
them overheated (above 39 °C) and was 
removed from the study. Here are the rectal 
Celsius temperatures of the other eight at 
the end of the walk (“Neural Network 
Training on Human Body Core Tempera- 
ture Data,” Combatant Protection and 
Nutrition Branch, Aeronautical and Mar- 
itime Research Laboratory of Australia, 
DSTO TN-0241, 1999): 


38.4 38.7 39.0 385 385 39.0 38.5 38.6 

a. Compute the t-based confidence interval 
of Section 8.2 for the population mean yj. 

b. Check for the validity of part (a). 

c. Use software to generate a bootstrap 
sample of means. Create a histogram of 
the resulting x* values. 

d. Use the standard deviation for part (c) to 
get a 95% bootstrap ¢ confidence inter- 
val for yz. Based on the histogram in part 
(c), is this CI valid? 

e. Use part (c) to form the 95% confidence 
bootstrap percentile interval for 1. 

f. Compare the intervals and explain your 


preference. 


68. 


69. 
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g. Based on your knowledge of normal 

body temperature, would you say that 

body temperature can be influenced by 
environment? 


Refer back to the body temperature data in 
the previous exercise. 


a. Obtain a bootstrap sample of 12.5% 
trimmed means. [Hint: With n= 8, a 
12.5% trimmed mean entails deleting 
the largest and smallest value in each 
resample. ] 

b. Use the standard deviation from the 
bootstrap samples in part (a) to get a 
95% bootstrap t confidence interval for 
the population 12.5% trimmed mean ,,. 

c. Investigate the distribution of the boot- 
strap trimmed means and discuss the 
validity of the interval in part (b). 

d. Use the results of part (a) to form a 95% 
confidence bootstrap percentile interval 
for [Ly,- 

e. Compare all the intervals for the mean 
and trimmed mean s,,. Are they fairly 
similar? How do you explain that? 


If you go to a major league baseball game, 
how long do you expect the game to be? 
From the 2430 games played in 2018, here 
is a random sample of 25 times (min): 


168 187 161 205 162 183 186 190 136 
177, 182 185 185 194 169 151 192 181 
194 162 194 171 172 168 174 


This is one of those rare instances in which 
we can calculate a confidence interval and 
compare with the actual population mean. 
The mean duration of all 2430 games was 
ft = 184.94 min (a little more than 3 h), but 
pretend we don’t know that. 


a. Compute the t-based confidence interval 
of Section 8.2. 

b. Use a normal probability plot to see if 
part (a) is valid. 

c. Use software to generate a bootstrap 
sample of means. 


494 


70. 


d. Use the standard deviation for part (c) to 
get a 95% bootstrap t confidence inter- 
val for w. 

e. Use part (c) to form the 95% confidence 
bootstrap percentile interval for j. 

f. Say which interval should be used and 
explain why. Does your interval include 
the true value, p = 184.94 min? 


Because of extra-inning games, the median 
might be a more meaningful statistic for the 
length-of-game data in the previous exer- 
cise. The median length of all 2430 MLB 
games in 2018 was jt = 182 min. 


a. Use software and the data in the previ- 
ous exercise to obtain a bootstrap sam- 
ple of medians. 

b. Obtain a 95% confidence bootstrap f in- 
terval for the population median. 

c. Investigate the distribution of the boot- 
strap medians and discuss the validity of 
part (b). 

d. Determine a 95% confidence bootstrap 
percentile interval for the median. 
Compare your answer with the popula- 
tion median. 


Supplementary Exercises (71-92) 


71. 


72. 


A manufacturer of college textbooks is 
interested in estimating the strength of the 
bindings produced by a particular binding 
machine. Strength can be measured by 
recording the force required to pull the 
pages from the binding. If this force is 
measured in pounds, how many books 
should be tested to estimate the average 
force required to break the binding to 
within .1 lb with 95% confidence? Assume 
that o is known to be .8. 


According to the article “Fatigue Testing of 
Condoms” (Polymer Testing 2009: 567- 
571), “tests currently used for condoms are 
surrogates for the challenges they face in 
use,” including a test for holes, an inflation 
test, a package seal test, and tests of 
dimensions and lubricant quality (all fertile 


73. 
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territory for the use of statistical method- 
ology!). The investigators developed a new 
test that adds cyclic strain to a level well 
below breakage and determines the number 
of cycles to break. A sample of 20 condoms 
of one particular type resulted in a sample 
mean number of 1584 and a sample stan- 
dard deviation of 607. Calculate and inter- 
pret a confidence interval at the 99% 
confidence level for the true average num- 
ber of cycles to break. [Note: The article 
presented the results of hypothesis tests 
based on the ¢ distribution; the validity of 
these depends on assuming normal popu- 
lation distributions. ] 


Before opening a new location, franchise 
companies conduct market research to 
determine if sufficient demand exists for 
their products. A national sandwich chain 
recently conducted a survey to investigate 
opening a franchise in a particular town. 
Among 300 households contacted through 
random-digit dialing, 198 respondents 
indicated they would patronize this shop. 


a. Let p = the proportion of all households 
in this town that would patronize the 
sandwich franchise. Calculate and inter- 
pret a 95% lower confidence bound for p. 

b. From years of marketing experience, the 
company knows they need more than 
5000 households in the population to 
patronize the shop—this accounts for 
competing local businesses and varia- 
tion in frequency of visitation by 
potential patrons. This particular town 
has 7700 households. Determine a 95% 
lower confidence bound for the number 
of households that will eat at the new 
store. Can the company be confident 
they will have enough customers? 

c. Imagine the company ignored sampling 
variability and simply used the sample 
proportion from the survey to determine 
the expected number of customers 
(rather than the lower confidence 
bound). Would that change their opin- 
ion regarding the viability of the new 
location? Explain. 


Supplementary Exercises 


74. 


75. 


Ist 


2nd 


3rd 


4th 


The Pew Forum on Religion and Public 
Life reported on Dec. 9, 2009 that in a 
survey of 2003 American adults, 25% said 
they believed in astrology. 


a. Calculate and interpret a confidence 
interval at the 99% confidence level for 
the proportion of all adult Americans 
who believe in astrology. 

b. What sample size would be required for 
the width of a 99% CI to be at most .05 
irrespective of the value of p? 

c. The upper limit of the CI in (a) gives an 
upper confidence bound for the pro- 
portion being estimated. What is the 
corresponding confidence level? 


There were 12 first-round heats in the 
men’s 100-m race at the 1996 Atlanta 
Summer Olympics. Here are the reaction 
times in seconds (time to first movement) of 
the top four finishers of each heat. The first 
12 are the 12 winners, then the second- 
place finishers, and so on. 


.187 152 137 175 172 165 
184 185 147 189 172 156 
.168 140 214 -163 .202 .173 
175 154 -160 169 148 144 
159 145 187 222 190 158 
.202 162 156 141 167 #155 
156 164 -160 145 163 .170 
182 .187 148 183 162 .186 


Because reaction time has little if any 
relationship to the order of finish, it is 
reasonable to view the times as coming 
from a single population. 


a. Estimate the population mean in a 
way that conveys information about 
precision and _ reliability. [Note: 
>> x; = 8.08100, 57x? = 1.37813.] 

b. Calculate a 95% confidence interval for 
the population proportion of reaction 
times that are below .15. (Reaction 
times below .10 are regarded as false 
starts, meaning that the runner antici- 
pates the starter’s gun, because such 
times are considered physically impos- 
sible. Linford Christie, who had a 
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reaction time of .160 in placing second 
in his first-round heat, had two such 
false starts in the finals and was 
disqualified.) 


76. Aphid infestation of fruit trees can be con- 


trolled either by spraying with pesticide or 
by inundation with ladybugs. In a particular 
area, four different groves of fruit trees are 
selected for experimentation. The first three 
groves are sprayed with pesticides 1, 2, and 
3, respectively, and the fourth is treated 
with ladybugs, with the following results 
on yield: 


Treatment n; (number of trees) =X; (bushels/tree) __ 5; 

1 100 10.5 1.5 

2 90 10.0 1.3 

3 100 10.1 1.8 

4 120 10.7 1.6 
Let pw; =the true average yield 


77. 


78. 


(bushels/tree) after receiving the ith treat- 
ment. Then 


1 
O = 5 (M1 + Ha + Hs) — Mg 

measures the difference in true average 
yields between treatment with pesticides 


and treatment with ladybugs. When 1, no, 
n3, and ng are all large, the estimator 0 
obtained by replacing each p; by X; is 
approximately normal. Use this to derive a 
large-sample 100(1 — ~)% CI for 0, and 
compute the 95% interval for the given data. 


It is important that face masks used by 
firefighters be able to withstand high tem- 
peratures because firefighters commonly 
work in temperatures of 200-500 °F. In a 
test of one type of mask, 11 of 55 masks had 
lenses pop out at 250°. Construct a 90% CI 
for the true proportion of masks of this type 
whose lenses would pop out at 250°. 


A journal article reports that a sample of 
size 5 was used as a basis for calculating a 
95% CI for the true average natural fre- 
quency (Hz) of delaminated beams of a 
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0-11 months old 


certain type. The resulting interval was 
(229.764, 233.504). You decide that a 
confidence level of 99% is more appropri- 
ate than the 95% level used. What are the 
limits of the 99% interval? [Hint: Use the 
center of the interval and its width to 
determine X and s.] 


The article “The Association Between 
Television Viewing and Irregular Sleep 
Schedules Among Children Less Than 3 
Years of Age” (Pediatrics 2005: 851-856) 
reported the following 95% confidence 
intervals for average TV viewing time 
(hours per day) for three different age 
groups. 


12-23 months old 24-35 months old 


80 


81. 


(0.8, 1.0) (1.4, 1.8) (2.1, 2.5) 

a. Interpret each of these three intervals. 

b. The three intervals are not the same 
width. What might explain this? 

c. Do the intervals suggest a relationship 
between age and TV viewing time 
among children of this age range? 


Explain. 


. In Example 7.12, we introduced the con- 


cept of a censored experiment in which 
n components are put on test and the 
experiment terminates as soon as r of the 
components have failed. Suppose compo- 
nent lifetimes are independent, each having 
an exponential distribution with parameter 
A. Let Y, denote the time at which the first 
failure occurs, Y> the time at which the 
second failure occurs, and so on, so that 
T, =Y¥,;+---+Y,+(n—r)Y, is the total 
accumulated lifetime at termination. Then it 
can be shown that 247, has a chi-squared 
distribution with 2r df. Use this fact to 
develop a 100(1 — «)% CI formula for true 
average lifetime 1/A. Compute a 95% Cl 
from the data in Example 7.12. 


Exercises 77-78 from Chapter 7 introduced 
“regression through the origin” to relate a 
dependent variable y to an independent 


82. 
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variable x. The assumption there was that 
for any fixed x value, the dependent vari- 
able is a random variable Y with mean 
value fx and variance o (so that Y has 
mean value zero when x= 0). The data 
consists of n independent (x;, Y;) pairs, 
where each Y; is normally distributed with 
mean fx; and variance o*. The likelihood is 
then a product of normal pdfs with different 
mean values but the same variance. 


a. Show that the 
B => Lx;¥;/Tx?. 
b. Verify that the mle of (a) is unbiased. 


c. Obtain an expression for V(f) and then 
for op. 


mle of f is 


d. For purposes of obtaining a precise 
estimate of f, is it better to have the x;’s 
all close to 0 (the origin) or quite far 
from 0? Explain your reasoning. 

e. The natural prediction of Y; is Bxi. Let 
S? = X(Y; — Bx;)’/(n—1), which is 
analogous to sample variance. It can be 
shown that T= (f— p)/(s/V/=%) 
has a ¢ distribution with n — 1 df. Use 
this to obtain a CI formula for estimat- 
ing f, and calculate a 95% CI using the 
data from the cited exercises. 

Let X,,...,X, be a random sample from a 

uniform distribution on the interval [0, 0] 

and Y = max(Xj,...,X,). Then methods 


from Section 5.7 can be used to show that 
the rv U = Y/@ has pdf 


fu(u) = nu! O<u<l 


a. Verify that 


Pl (a/2)'/"< —<(1—a/2)'"| =1-4 


Y 

0 
and use this to derive a 100(1 — «)% CI 
for 0. 

b. Verify that P(a!/"<Y/0<1)=1-a, 
and derive a 100(1 — «)% CI for 0 
based on this probability statement. 
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c. Which of the two intervals derived in 
(a) and (b) is shorter? If your waiting time 
for a morning bus is uniformly dis- 
tributed and observed waiting times are 
xX, = 4.2, x2 = 3.5, x3= 1.7, X= 1.2, 
and x5 = 2.4, obtain a 95% CI for 0 by 
using the shorter of the two intervals. 


Let 0 < y <a. Then a 100(1 — «)% CI for 
Lt when n is large is 


ie S _ S 
X = Zy + S= K+ Zy-y + T= 


vn / vn 
The choice y = a/2 yields the large-sample 
interval derived in Section 8.2; if y 4 «/2, 
this confidence interval is not symmetric 
about x. The width of the interval is 
W = $(Z) + Zy-y)//n. Show that w is min- 
imized for the choice y = «/2, so that the 
symmetric interval is the shortest. [Hints: 
(1) By definition of z,, ®(z,) = 1 — a, so 
that z, = ® ‘(1 — «); (2) the relationship 
between the derivative of a function y = 
fx) and the inverse function x = f—!(y) is 


(d/dy)f~'(y) = 1/f'(x).] 


Suppose x1, X2, ..., X, are observed values 
resulting from a random sample from a 
symmetric but possibly heavy-tailed distri- 
bution. Chapter 11 of Understanding 
Robust and Exploratory Data Analysis (see 
the bibliography) suggests the following 
robust 95% CI for the population mean 
(point of symmetry): 


ose 
Lu 


— t critical ve) igr 


1.075 vn 


The value of the quantity in parentheses is 
2.10 for n = 10, 1.94 for n = 20, and 1.91 
for n= 30. Compute this CI for the 
restaurant tip data of Example 8.17, and 
compare to the ¢ CI appropriate for a nor- 
mal population distribution. 


a. Use the results of Example 8.5 to obtain 
a 95% lower confidence bound for the 
parameter 2 of an_ exponential 


86. 


87. 
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distribution, and calculate the bound 
based on the data given in the example. 
b. If lifetime X has an exponential distri- 
bution, the probability that lifetime 
exceeds f is given by P(X >t) =e”. 
Use the result of part (a) to obtain a 95% 
lower confidence bound for the proba- 


bility that lifetime exceeds 100 min. 


Let 0, and 02 denote the mean weights for 
animals of two different species. A biolo- 
gist wishes to estimate the ratio 0/0. 
Unfortunately the species are extremely 
rare, so the estimate will be based on 
finding a single animal of each species. Let 
X; denote the weight of the species i animal 
(i= 1, 2), assumed to be normally dis- 
tributed with mean 0; and standard devia- 
tion 1. 


a. Show that the rv A(X), X2; 601,02) = 


(0.X, — 0,X2)/ +e is a pivotal 
quantity by determining the distribution 
of h. 

b. Show that 4 depends on 0; and 02 only 
through 0@,/03. [Hint: Divide numerator 
and denominator by 0.] 

c. Consider Expression (8.7) from the first 
section of this chapter with a = —1.96 
and b = 1.96. Now replace < by = and 
solve for 0,/02. Then show that a con- 
fidence interval results if eG +33 > 
1.967, whereas if this inequality is not 
satisfied, the resulting confidence set is 
the complement of an interval. 


The one-sample CI for a normal mean and 
PI for a single observation from a normal 
distribution were both based on the central 
t distribution. A CI for a particular per- 
centile (e.g., the Ist percentile or the 95th 
percentile) of a normal population distri- 
bution is based on the noncentral t distri- 
bution. A particular distribution of this type 
is specified by both df and the value of the 
noncentrality parameter 6 (6 = 0 gives the 
central ¢ distribution). The key result is that 
the variable 
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xt — (z percentile),/n 
S/o 


= 


has a noncentral ¢ distribution with df = 
n— land 6 = —(z percentile) ,/n. 

Let fo25,y,6 and f975,,,5 denote the critical 
values that capture upper-tail area .025 and 
lower-tail area .025, respectively, under the 
noncentral ¢ curve with v df and noncen- 
trality parameter 6 (when 6=0, 
to75 = —to25, since central ¢ distributions 
are symmetric about 0). 


a. Use the given information to obtain a 
formula for a 95% confidence interval 
for the (100p)th percentile of a normal 
population distribution. 

b. For 6 = 6.58 and df = 15, to75 and t.25 
are (from software) 4.1690 and 10.9684, 
respectively. Use this information to 
obtain a 95% CI for the 5th percentile of 
the beer alcohol distribution considered 
in Exercise 17. 


In this exercise, we develop a CI for j that 

is valid whatever the shape of the popula- 

tion distribution as long as it is continuous. 

Let Xj, ..., X, be arandom sample from the 

distribution and Y,;<---<Y, denote the 

corresponding ordered values (smallest 
observation, second smallest, and so on). 

a. What is P(X,;<ji)? What is 
P(X, < fl} {Xo <f})? 

b. What is P(Y,, < jt)? What is P(Y, > jt)? 
[Hint: What condition involving all of 
the X;’s is equivalent to the largest being 
smaller than the population median?] 

c. What is P(Y, < ji < Y,,)? What does this 
imply about the confidence level asso- 
ciated with the CI (y,,y,) for 1? 

d. An experiment carried out to study the 
time (min) necessary for an anesthetic to 
produce the desired result yielded the 
following data: 31.2, 36.0, 31.5, 28.7, 
37.2, 35.4, 33.3, 39.3, 42.0, 29.9. 
Determine the confidence interval of 
(c) and the associated confidence level. 
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8 Statistical Intervals Based on a Single Sample 


Consider the situation described in the 

previous exercise. 

a. What is P({Xi<pf}N{X2 > phn 
++ M{X, > j}), that is, the probability 
that only the first observation is smaller 
than the median? 

b. What is the probability that exactly one 
of the n original observations is smaller 
than the median? 

c. What is P(ji< Y2)? [Hint: The event in 
parentheses occurs if all n of the 
observations exceed the median. How 
else can it occur?] 

d. What is P(Y2 <jt< Y,-1)? What does 
this imply about the confidence level 
associated with the CI (y2, yn-1) for jx? 

e. Determine the confidence level and CI 
using part (d) with the data given in the 
previous exercise. 


The previous two exercises considered a CI 
for a population median ji based on the 
ordered values from a random sample. 
Let’s now consider a prediction interval for 
the next observation X,,,, which is 


assumed to be independent of Xj,..., Xn. 

a. What is P(X,4; <X,)? What is 
PUXn+1 < X} a) {Xn+1 < X>})? 

b. What is P(X,4; < Y,)? What is 
PXn41 > Y,,)? 


c. What is P(Y; <X,41<Y,)? What does 
this say about the prediction level for 
the PI (y1, y,)? Determine the prediction 
level and interval for the data in the 
previous two exercises. 


Consider 95% CIs for two different 
parameters 0, and 0 , and let A; (i = 1, 2) 
denote the event that the value of 0; is 
included in the random interval that results 
in the CI. Thus P(A;) = .95. 


a. Suppose that the data on which the CI 
for 0; is based is independent of the data 
used to obtain the CI for 02 (e.g., we 
might have 0, = y, the population mean 
height for American females, and 
02 =p, the proportion of all iPhones 
that don’t need warranty service). What 
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can be said about the simultaneous 
confidence level for the two intervals? 
That is, how confident can we be that 
the first interval contains the value of 0, 
and that the second contains the value 
of 05? [Hint: Consider P(A; M A2).] 

b. Now suppose the data for the first CI is 
not independent of that for the second 
one. What now can be said about the 
simultaneous confidence level for both 
intervals? [Hint: Consider P(A{ UA%), 
the probability that at least one interval 
fails to include the value of what it is 
estimating. Now use the fact that 
P(A) UAS) < P(A)) + P(A). The gen- 
eralization of the bound on P(A{ UA%) 
to the probability of a k-fold union is one 
version of the Bonferroni inequality. ] 

c. What can be said about the simultane- 
ous confidence level if the confidence 
level for each interval separately is 
100(1 — «)%? What can be said about 
the simultaneous confidence level if a 
100(1 — «)% CI is computed separately 
for each of k parameters 0,,..., 0,? 


92. The Bonett CI for a population variance a 


mentioned at the end of Section 8.4, unlike 
the chi-squared method, does not hinge on 
population normality. This interval 
involves a transformation along with an 
estimate of the kurtosis of the underlying 
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distribution, a measure of its “tail” behav- 
ior. Specifically, Bonett defines a kurtosis 
estimate by 


where X, is the trimmed mean with trim 


proportion 1/[2/n—4]. Then the 
Bonett CI for o? with confidence level 


100(1 — «)% has endpoints 


— 3)y 
exp ne -S)+ ny i a 
where c = n/(n — Zy2) is “an empirically 
determined, small-sample adjustment” 
(meaning Bonett found this value by trial 
and error). 


a. For the study hours data in Exercise 63, 
n = 22, s = 4.603 and 7, = 7.003. Use 
Bonett’s formula to calculate a 95% CI 
for the population variance a. 

b. Use part (a) to determine a 95% CI 
for o. 

c. Show that as n — oo, both endpoints of 
the Bonett CI converge to o*. [Hint: The 
kurtosis estimate y, converges to a 
constant, while S* — o?.] 


®) 


Check for 
updates 


Introduction 

A parameter can be estimated from sample data either by a single number (a point estimate) or an 
entire interval of plausible values (a confidence interval). Frequently, however, the objective of an 
investigation is not to estimate a parameter but to decide which of two contradictory claims about the 
parameter is correct. Methods for accomplishing this comprise the part of statistical inference called 
hypothesis testing. In this chapter, we first discuss some of the basic concepts and terminology in 
hypothesis testing and then develop decision procedures for the most frequently encountered testing 
problems based on a sample from a single population. 


9.1 Hypotheses and Test Procedures 


A statistical hypothesis, or just hypothesis, is a claim or assertion either about the value of a single 
parameter (i.e., a characteristic of a population or a probability distribution), about the values of 
several parameters, or about the form of an entire probability distribution. Examples include 


e The claim yp = $311, where is the true average one-term textbook expenditure for students at a 
university 

e The statement p < .50, where p is the proportion of adults who approve of the job that the 
President is doing 

e The assertion that “4, — Uo > 5, where fu, and pp denote the true average decreases in systolic 
blood pressure for two different drugs 

e The claim that stopping distance for a car under particular conditions has a normal distribution. 


Hypotheses of the last sort will be considered briefly in Chapter 13. In this and the next several 
chapters, we concentrate on hypotheses about parameters. 

In any hypothesis-testing problem, there are two contradictory hypotheses under consideration. 
One hypothesis might be the claim yp = $311 and the other u 4 $311, or the two contradictory 
statements might be p > .50 and p < .50. The objective is to decide, based on sample information, 
which of the two hypotheses is correct. In statistics, hypothesis-testing problems are formulated so 
that one of the claims is initially assumed to be true. This initial claim will not be rejected in favor of 
the alternative claim unless sample evidence provides strong evidence for the latter. 
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DEFINITION The null hypothesis, denoted by Hp, is the claim that is initially assumed to be true 
(the “prior belief” claim). The alternative hypothesis, denoted by H,, is the 
assertion that is contradictory to Ho. 


The null hypothesis will be rejected in favor of the alternative hypothesis only if sample evidence 
suggests that Ho is false. If the sample does not strongly contradict Ho, we will continue to believe in 
the plausibility of the null hypothesis. The two possible conclusions from a hypothesis-testing 
analysis are then reject Ho or fail to reject Ho. 

A test of hypotheses is any method for using sample data to decide whether the null hypothesis 
should be rejected. Thus if we test Hp: “ = $311 against the alternative H,: « ~ $311, the null 
hypothesis should be rejected only if sample data strongly suggests that is something other than 
$311. In the absence of strong evidence, Hp should not be rejected since it is still judged to be 
plausible. 

There is a familiar analogy to this in a criminal trial. One claim is the assertion that the defendant is 
innocent. In the U.S. judicial system, this is the claim that is initially believed to be true. Only in the 
face of strong evidence to the contrary should the jury reject this claim in favor of the alternative 
assertion that the accused is guilty. In this sense, the claim of innocence is the favored or protected 
hypothesis, and the burden of proof is placed on those who believe in the alternative claim. 


Formulating Hypotheses 

Sometimes an investigator does not want to accept a particular assertion unless and until data can 
provide strong support for the assertion. In that situation, this assertion will be the investigator’s 
alternative hypothesis H,,. (Examples will be given shortly.) Scientific research often involves trying 
to decide whether a current theory should be replaced by a more plausible and satisfactory expla- 
nation of the phenomenon under investigation. A conservative approach is to identify the current 
theory with Hp and the researcher’s alternative explanation with H,. Rejection of the current theory 
will then occur only when evidence is much more consistent with the new theory. In many situations, 
H, is referred to as the “research hypothesis,” since it is the claim that the researcher would really like 
to validate. The word null means “of no value, effect, or consequence,” which suggests that Ho should 
be identified with the hypothesis of no change (from current opinion), no difference, no improvement, 
and so on. 


Example 9.1 Many have heard the claim that college students gain an average of 15 lb during their 
first year, but is this popular legend rooted in reality? This was the subject of the article “The Effects 
of College on Weight: Examining the ‘Freshman 15’ Myth and Other Effects of College Over the Life 
Cycle” (Demography 2017: 311-336). Let 4: denote the true average weight gain of students over the 
course of their first year in college. We initially give the “freshman 15” story the benefit of the doubt, 
so that our null hypothesis is Ho: w = 15. It would be noteworthy if 15 were an underestimate or an 
overestimate, suggesting that the alternative hypothesis should be H,: uw # 15. @ 


Example 9.2 Vintners and wine consumers continue to debate whether to seal wine bottles with 
corks or screwtops. Screwtops can reduce spoilage, but many wine enthusiasts associate them with 
cheap or otherwise undesirable wines. An often-cited 2011 survey by Rebecca Bleibaum (Tragon 
Corp.) found that about 20% of wine consumers would not buy screwtop wine, but negative attitudes 
toward screwtops have abated over time. Let p denote the proportion of wine consumers today who 
refuse to purchase screwtop wine. The null hypothesis is that no change has occurred since that 
previous survey, Ho: p = .2. A winery that was considering switching one of its wines from cork to 
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screwtop bottling would naturally be interested in the alternative hypothesis that this proportion has 
decreased, H,: p < .2. a 


Example 9.3 Consumer Reports (Nov. 26, 2018) reported that some American automakers are 
reducing sedan production in response to the increasing popularity of trucks and SUVs, despite the 
fact that sedans typically get better gas mileage. Extensive experience with engines for a certain type 
of light-duty truck indicates that highway fuel efficiency (miles per gallon) is normally distributed 
with a mean value of 25 and a standard deviation of 3. The manufacturer is considering a modification 
to increase average fuel efficiency. Let 4 denote true average efficiency for the new, modified engines. 
The appropriate null (no-improvement) hypothesis is Ho: u = 25. The alternative hypothesis asserts 
that there has, in fact, been an improvement: H,: u > 25. 

Sample data will be collected from modified engines. Because of the expense of changing the 
manufacturing process, the new engine design will only be adopted if the data provides convincing 
evidence that y really is greater than 25 mpg. a 


In our treatment of hypothesis testing, Hg will generally be stated as an equality claim. If 0 denotes 
the parameter of interest, the null hypothesis will have the form Ho: 0 = 0, where po is a specified 
number called the null value of the parameter (i.e., the value claimed for @ by the null hypothesis). 
For instance, consider the truck gas mileage situation of Example 9.3. The alternative hypothesis is 
H,: « > 25, the claim that the mean fuel efficiency is improved by the engine modification. The null 
hypothesis was stated as Ho: uw = 25, so the null value of the parameter is fg = 25. But it would be 
more mathematically natural to write Hp: u < 25, according to which the new engine either is no 
better or is worse than the one currently used. The rationale for using a simplified null hypothesis is 
that any reasonable procedure for deciding between Ho: uw = 25 and H,: wu > 25 will also be rea- 
sonable for deciding between the claim that 4 < 25 and Hj, and should lead to exactly the same 
conclusion for any particular sample. The use of a simplified Ho is preferred because it has certain 
technical benefits, which will become apparent shortly. 

The alternative to the null hypothesis Hp: 0 = 09 will look like one of the following three 
assertions: 

1. H,: 0 > 05 (in which case the implicit null hypothesis is 8 < 6) 
2. H,: 0 < 09 (so the implicit null hypothesis states that 0 > 0) 
3. Hy: 0 F Oo 


Test Procedures 

A test procedure is a rule, based on sample data, for deciding whether to reject Hp. A test of Ho: p = .2 
versus H,: p < .2 in Example 9.2 might be based on surveying a random sample of n = 200 current 
wine consumers. Let X denote the number of people in the sample who refuse to buy screwtop wine, a 
binomial random variable (at least approximately); let x represent the observed value of X. If Ho is 
true, E(X) = np = 200(.2) = 40, whereas we can expect fewer than 40 refusers if H, is true. An 
x value just a bit below 40 does not strongly contradict Ho, so it is reasonable to reject Ho in favor of 
H,, only if x is substantially less than 40. One such test procedure is to reject Ho if x < 35 and not 
reject Hp otherwise. This procedure has two elements: (1) a test statistic, or function of the sample 
data, used to make a decision; and (2) a rejection region consisting of those test statistic values for 
which Hp will be rejected in favor of H,. In the wine scenario, X is the test statistic and the rejection 
region consists of x = 0, 1, 2, ..., 35; Ho will not be rejected if x = 36, 37, ..., 199, or 200. 
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DEFINITION A test procedure is specified by the following: 
1. A test statistic, a function of the sample data on which the decision 
(reject Ho or do not reject Ho) is to be based 
2. A rejection region, the set of all test statistic values for which Hp will 
be rejected 
The null hypothesis will then be rejected if and only if the observed or 
computed test statistic value falls in the rejection region. 


In the context of Example 9.3, let X denote the sample average highway fuel efficiency of a random 
sample of 10 trucks with the new, modified engine. If Ho is true, E(X) = = 25, whereas if Hp is 
false, we expect X to exceed 25. Strong evidence against Ho is provided by a value x that considerably 
exceeds 25. Thus we might use X as a test statistic along with the rejection region x > 30. 

In both the wine and truck examples, the choices of the test statistic and the form of the rejection 
region make sense intuitively. However, the choice of cutoff value used to specify the rejection region 
was somewhat arbitrary. Instead of rejecting Hp: p = .2 in favor of H,: p < .2 when x < 35, we 
could use the rejection region x < 30. For this region, Hy would not be rejected if 33 respondents 
refused to buy screwtop wine, whereas this occurrence would lead to rejection of Ho if the initially 
suggested region were employed. Similarly, the rejection region x > 27.5 might be used in the truck 
engine problem in place of the region x > 30. We’ll discuss shortly the tradeoffs between different 
rejection region cutoffs and how they are most often determined in practice. 


Errors in Hypothesis Testing 

When a jury is called upon to render a verdict in a criminal trial, there are two possible erroneous 
conclusions: convicting an innocent person, or letting a guilty person go free. Similarly, in statistical 
hypothesis testing there are two potential errors whose consequences must be considered when 
reaching a conclusion. 


DEFINITION A type I error consists of rejecting the null hypothesis Hp when it is true. 
A type II error involves not rejecting Ho when it is false (i.e., H, is true). 


Since in the U.S. judicial system the null hypothesis (a priori belief) is that the accused is innocent, a 
type I error is analogous to convicting an innocent person, while a false acquittal (i.e., letting a guilty 
person go free) equates to a type II error. 


Example 9.4 (Example 9.3 continued) Before selecting a test procedure and collecting data, the 
truck manufacturer must consider the possible type I and type II errors along with their consequences. 
In this scenario, a type I error means that the manufacturer concludes the modified engine design 
improves fuel efficiency when, in fact, it does not. Thus a type I error would lead the manufacturer to 
perform a very expensive but ultimately useless overhaul of its truck engines. Because this is such a 
consequential error, a test procedure should be selected that makes the chance of a type I error very 
small: if the modified engine design is truly no better than the old one (i.e., Ho is true), this would 
ensure a low probability of mistakenly rejecting Hp and proceeding with the change. 

Balanced against this possibility is the threat of a type II error: failing to reject Hp when, in fact, 
H,: > 25 is correct. That is, in a type II error the manufacturer would fail to recognize that the 
modified engine design improves fuel efficiency and would continue to use the old, inferior design. 
A type II error is often called an opportunity loss in business: the manufacturer has missed out on the 
opportunity to build, sell, and profit from a superior engine design. a 
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It would be nice if test procedures could be developed that offered 100% protection against 
committing both a type I error and a type II error. This is an impossible goal, though, because our 
conclusion is based on sample data rather than a census of the entire population. There is always some 
chance that random sampling variability will lead to an incorrect conclusion. Instead of demanding 
error-free procedures, we must look for procedures for which both types of error are unlikely to occur. 
That is, a good procedure is one for which the probability of making either type of error is small. The 
choice of a particular rejection region cutoff value fixes the probabilities of type I and type II errors. 
These error probabilities are traditionally denoted by « and f, respectively. Because Ho specifies a 
unique value of the parameter, there is a single value of «. However, there is a different value of ( for 
each value of the parameter consistent with H,. 


Example 9.5 (Example 9.2 continued) A small winemaker will conduct a pilot study by surveying 
n = 25 randomly selected customers about their views on screwtop wine bottles. The parameter of 
interest is now p = the proportion of this winery’s customers who refuse to buy screwtop wine, but 
the hypotheses will remain Hp: p = .2 versus H,: p < .2. Consider the following test procedure: 
Test statistic: X = the number of surveyed customers who will not buy screwtop wine bottles 
Rejection region: R3 = {0,1,2,3}; that is, reject Hp ifx <3, 
where x is the observed value of the test statistic. 
This rejection region is called lower-tailed because it consists only of small values of the test statistic. 
When Hp is true, X has a binomial probability distribution with n = 25 and p = .2. Then 


a = P(type I error) = P(Hp is rejected when it is true) 
= P(X <3 when X~ Bin(25, .2)) = B(3; 25, .2) 
= .234 


That is, if Ho is actually true, 23.4% of all pilot surveys consisting of 25 customers would result in Ho 
being incorrectly rejected (a type I error). This error probability is quite large; we will consider shortly 
how it can be made smaller. 

In contrast to «, there is not a single f. Instead, there is a different / for each different p less than .2. 
Thus there is a value of f for p = .15 [in which case X ~ Bin(25, .15)], another value of f for p = .1, 
and so on. For example, 


B(.1) = P(type II error when p = .1) 


(Ho is not rejected when it is false because p = .1) 


P 
P(X > 3 when X ~ Bin(25, .1)) = 1 — B(3; 25,.1) = .236 


I 


When p is actually .1 rather than .2 (a rather large departure from Ho), roughly 24% of all surveys of 
this type would result in Ho being incorrectly not rejected. 

The accompanying table displays f for selected values of p (each calculated for the rejec- 
tion region R3). Clearly, 8 decreases as the value of p moves farther below the null value .2. 
Intuitively, the greater the departure from Ap, the less likely it is that such a departure will sneak past 
undetected. 


Pp 19 15 12 10 08 04 
B(p) 727 529 352 236 135 017 
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Many of these values of f values are unacceptably large, due in part to the relatively small sample size. 

The proposed test procedure is still reasonable for testing the more mathematically correct null 
hypothesis that p > .2. In this case, there is no longer a single «, but instead there is an « for 
each p that is at least .2: «(.2), «(.25), «(.314), «(.325), and so on. It is easily verified, though, 
that «(p) < «(.2) = .234 for all p > .2. That is, the largest value of « occurs for the boundary value 
.2 between Ho and H,. Thus whatever the probability « is for the simplified null hypothesis, it will be 
no larger for the more realistic Ho. | 


Example 9.6 (Example 9.3 continued) Let X,, ..., X19 denote the highway fuel efficiencies (mpg) of 
10 randomly selected trucks with the new, modified engine. (If n = 10 seems small, bear in mind that 
vehicles used for testing often cannot then be sold to customers.) Under the assumptions of Example 
9.3, X,, ..., Xo is a random sample of size 10 from a normal distribution with mean value jy and 
standard deviation o = 3. To test Ho: = 25 versus H,: > 25, consider the following test 
procedure: 

Test statistic: X = the sample mean fuel efficiency of the 10 randomly selected trucks 

Rejection region: R = [27.5, oo); that is, reject Ho if x > 27.5, 

where x is the observed value of the test statistic 

Because the rejection region consists only of large values of the test statistic, the test is said to be 
upper-tailed. 

The sample mean fuel efficiency X then has a normal distribution with py =p and 
ox = o//n =3/V10 = .95. Calculation of « and f now involves a routine standardization of X 
followed by reference to the standard normal probabilities of Appendix Table A.3: 


o = P(type I error) = P(Hp is rejected when it is true) 
= P(X > 27.5 when X ~ normal with py = 25, ox = .95) 


95 
§(26.5) = P(type II error when pp = 26.5) 
= P(Hp is not rejected when it is false because pu = 26.5) 
= P(X <27.5 when X ~ normal with py = 26.5, cx = .95) 
27.5 — 26.5 
7 of 95 
27.5 — 28 
95 


1 0(7 =) = 1— (2.63) = .0042 


) = (1.05) = .8531 


B(28) = of ) = .2993 B(29) = .0571 


For the specified test procedure, only 0.4% of all experiments carried out as described will result in Ho 
being rejected when it is actually true. However, the chance of a type II error is very large when 
Lt = 26.5 (only a small departure from Ho), somewhat less when pt = 28, and quite small when pt = 29 
(a rather large departure from Ho). These error probabilities are illustrated in Figure 9.1. Notice that « 
is computed using the probability distribution of the test statistic when Hp is true, whereas 
determination of f requires knowing the test statistic’s distribution when Hp is false. 
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a 
Shaded area = a =.0042 
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Shaded area = f(26.5) ; 
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1 
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1 
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1 
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Figure 9.1 « and f illustrated for Example 9.6: (a) the distribution of X when w = 25 (Ap true); 
(b) the distribution of X when p = 26.5 (Hp false); (c) the distribution of X when p = 28 (Hp false) 


As in Example 9.5, if the more realistic null hypothesis 4 < 25 is considered, there is an « for 
each parameter value for which Hp is true: «(25), «(24.2), «(23.6), and so on. It is easily verified, 
though, that «(25) is the largest of all these type I error probabilities. Focusing on the boundary value 
amounts to working explicitly with the “worst case.” fi 


Selecting the Rejection Region 

The specification of a cutoff value for the rejection region in the examples just considered was fairly 
arbitrary. Use of the rejection region R3 = {0, 1, 2, 3} in Example 9.5 resulted in « = .234, 
f(.10) = .236, and f(.15) = .529. Many would think these error probabilities intolerably large. 
Perhaps they can be decreased by changing the cutoff value. 


Example 9.7 (Example 9.5 continued) Let us use the same survey plan and test statistic X as 
previously described in the screwtop wine problem but now consider the rejection region Rz = {0, 1, 
2}. Since X still has a binomial distribution with parameters n = 25 and p, 


oa = P(Hp is rejected when p = .2) 
= P(X <2when X ~ Bin(25, .2)) = B(2; 25, .2) = .098 


The type I error probability has been decreased by using the new rejection region. However, a price 
has been paid for this decrease: 
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B(.1) = P(Ah is not rejected when p = .1) 
= P(X > 2 when X~Bin(25, .1)) = 1 — B(2; 25, .1) = 463 
B(.15) = 1 — B(2;25, .15) = .746 


Both these f’s are larger than the corresponding error probabilities .236 and .529 for the region R3. In 
retrospect, this is not surprising: « is computed by summing over probabilities of test statistic values 
in the rejection region, whereas f is the probability that X falls in the complement of the rejection 
region. Making the rejection region smaller must therefore decrease « while increasing f for any fixed 
alternative value of the parameter. a 


A similar trade-off between « and f would occur if we changed the rejection region cutoff in 
Example 9.6. Looking at Figure 9.1, it’s clear that if we shifted the cutoff c = 27.5 to the left (e.g., to 
c = 27), the two f’s illustrated would decrease (less cumulative area under the normal curves) but « 
would increase (greater upper-tail area than before). The results of these examples can be generalized 
in the following manner. 


PROPOSITION Suppose a study and a sample size are fixed and a test statistic is chosen. Then 
decreasing the size of the rejection region to obtain a smaller value of « results 
in a larger value of f for any particular parameter value consistent with H,, and 
vice versa. 


This proposition says that once the test statistic and n are fixed, there is no rejection region that will 
simultaneously make both « and all f’s small. A region must be chosen to effect a compromise 
between « and f. The approach adhered to by most statistical practitioners is to specify the largest 
value of « that can be tolerated and find a rejection region having that value of «. This makes f as 
small as possible subject to the bound on «. The resulting value of « is often referred to as the 
significance level of the test. Traditional levels of significance are .10, .05, and .01, although the level 
in any particular problem will depend on the seriousness of a type I error—the more serious this error, 
the smaller should be the significance level. The corresponding test procedure is called a level « test 
(e.g., a level .05 test or a level .01 test). A test with significance level « is one for which the type I 
error probability is controlled at the specified level. 


Example 9.8 (Example 9.6 continued) For the truck engine scenario, suppose a hypothesis test with 
significance level « = .05 is desired. The rejection region will still have the form x > c, but the value 
of c is determined by «: 


.05 = P(type I error) = P(X >c when H, is true) 
= P(X >c when X ~ N(25, .95)) 
c—25 c— 25 
=1-@0 =>@® = .95 
(sr) = (SF) 

The last expression above implies that (c — 25)/.95 is the 95th percentile of the standard normal 
distribution, zs. Either from Section 4.3 or directly from Appendix Table A.3, zo5 = 1.645, from 
which the desired rejection region cutoff is c = 25 + (1.645)(.95) & 26.56 mpg. 


So, a level .05 test of Ho: = 25 versus H,: pp > 25 in this scenario involves rejecting Ho if and 
only if ¥ > 26.56. Then f is the probability that X <26.56 and can be calculated for any > 25. 
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Power 

Many statistical software packages will calculate type II error probabilities for a variety of test 
procedures, including those presented in this and subsequent chapters. This is typically expressed in 
terms of power, defined as the probability that the test procedure will reject Hp. For parameter values 
consistent with H,, this is simply 1 — f. As the name is meant to imply, greater power is better: lower 
values of f (i.e., less chance of a type II error) correspond to higher power values. 

Like f, there is not a single value for the power of a test procedure, but rather a different value for 
each possible value of the parameter. As a result, though power can be calculated for a single 
parameter value, it is more common to see a power curve, where the horizontal axis represents 
possible values of the parameter and the vertical axis displays power. 


Example 9.9 (Example 9.8 continued) Figure 9.2 shows the power curve for the test procedure that 
rejects Hy when x > 26.56. For each value of w consistent with H,: > 25, the power of the test 
procedure is simply P(X > 26.56). Note that the power of the test at = 25 is « = .05 by the 
definition of power. The power increases as the value of the parameter moves further from the null 
value—a large departure from Hp is more likely to be detected than a small departure. 


Power 


1.0 
0.8 
0.6 
0.4 
0.2 


25 26 27 28 29 30 


Figure 9.2 Power curve for the test procedure of Examples 9.8-9.9 a 


One final, but important, note: the probabilities «, 6, and power are all functions of the selected test 
procedure, not of any sample data. They reflect the chance that certain outcomes of the test procedure 
will happen in the future, when a random sample of the specified size n is selected. 


Exercises: Section 9.1 (1-14) 


1. For each of the following assertions, state 


2. For the following pairs of assertions, indicate 
whether it is a legitimate statistical hypothe- 


which do not comply with our rules for set- 


an exponential distribution used to model 


Hoe S; = Si, Be AS, 
component lifetime 


Ao: p = 120, Ha: w = 150 


sis and why: ting up hypotheses and why (the subscripts 1 
a. H: o > 100 and 2 differentiate between quantities for two 
b. H:x = 45 different populations or samples): 
c. His < .20 a. Ho: « = 100, H,: uw > 100 
d. H: 0/02 <1 b. Ho: ¢ = 20, Hy: « < 20 
e. HiX—-Y=5 c. Ho: p # .25, Hy: p = .25 
f. H: 2 < .01, where / is the parameter of d. Ho: by — po = 25, Hat Ly — fo > 100 
e. 
f. 
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3. 


g. Ho: o;/o = 1, Hy: oy/on # 1 
h. Ao: pi — p2 = —-1, Hat pi — po < —.1 


To determine whether the girder welds in a 
new performing arts center meet specifica- 
tions, a random sample of welds is selected, 
and tests are conducted on each weld in the 
sample. Weld strength is measured as the 
force required to break the weld. Suppose the 
specifications state that mean strength of 
welds should exceed 100 Ib/in?; the inspec- 
tion team decides to test Ho: u = 100 versus 
H,:; > 100. Explain why it might be 
preferable to use this H, rather than 4 < 100. 


. Let 4 denote the true average radioactivity 


level (picocuries per liter). The value 5 pCi/L 
is considered the dividing line between safe 
and unsafe water. Would you recommend 
testing Hp: w=5 versus H,: w>S5S or 
Ho: w=5 versus H,: « <5? Explain your 
reasoning. [Hint: Think about the conse- 
quences of a type I and type II error for each 
possibility. ] 


. Before agreeing to purchase a large order of 


polyethylene sheaths for a particular type of 
high-pressure oil-filled submarine power 
cable, a company wants to see conclusive 
evidence that the true standard deviation of 
sheath thickness is <.05 mm. What 
hypotheses should be tested, and why? In this 
context, what are the type I and type II 
errors? 


. Many older homes have electrical systems 


that use fuses rather than circuit breakers. 
A manufacturer of 40-amp fuses wants to 
make sure that the mean amperage at which 
its fuses burn out is in fact 40. If the mean 
amperage is lower than 40, customers will 
complain because the fuses require replace- 
ment too often. If the mean amperage is 
higher than 40, the manufacturer might be 
liable for damage to an electrical system due 
to fuse malfunction. To verify the amperage 
of the fuses, a sample of fuses is to be 
selected and inspected. If a hypothesis test 
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were to be performed on the resulting data, 
what null and alternative hypotheses would 
be of interest to the manufacturer? Describe 
type I and type II errors in the context of this 
problem situation. 


. Water samples are taken from water used for 


cooling as it is being discharged from a 
power plant into a river. It has been deter- 
mined that as long as the mean temperature 
of the discharged water is at most 150 °F, 
there will be no negative effects on the river’s 
ecosystem. To investigate whether the plant 
is in compliance with regulations that pro- 
hibit a mean discharge-water temperature 
above 150°, 50 water samples will be taken at 
randomly selected times, and the temperature 
of each sample recorded. The resulting 
data will be used to test the hypotheses 
Ho: # = 150° versus H,: > 150°. In the 
context of this situation, describe type I and 
type II errors. Which type of error would you 
consider more serious? Explain. 


. A regular type of laminate is currently being 


used by a manufacturer of circuit boards. 
A special laminate has been developed to 
reduce warpage. The regular laminate will be 
used on one sample of specimens and the 
special laminate on another sample, and the 
amount of warpage will then be determined 
for each specimen. The manufacturer will 
then switch to the special laminate only if it 
can be demonstrated that the true average 
amount of warpage for that laminate is less 
than for the regular laminate. State the rele- 
vant hypotheses, and describe the type I and 
type II errors in the context of this situation. 


. Two different companies have applied to 


provide internet service in a region. Let 
p denote the proportion of all potential sub- 
scribers who favor the first company over the 
second. Consider testing Hp: p = .5 versus 
H,: p # .5 based on a random sample of 25 
individuals. Let X denote the number in the 
sample who favor the first company and 
x represent the observed value of X. 


9.1 


10. 


Hypotheses and Test Procedures 


a. Which of the following rejection regions 

is most appropriate and why? 
R, = {x:x<7 orx> 18}, 
Ry = {x:x<8},R3 = {x:x> 17} 

b. Using the selected rejection region, what 
would you conclude if 6 of the 25 queried 
favored company 1? 

c. In the context of this problem situation, 
describe what type I and type II errors are. 

d. What is the probability distribution of the 
test statistic X when Ho is true? Use it to 
compute the probability of a type I error. 

e. For the region selected in part (a), com- 
pute the probability of a type II error and 
the power when p = .3, .4, .6, and .7. 


For healthy individuals the level of pro- 
thrombin in the blood is approximately nor- 
mally distributed with mean 20 mg/dL and 
standard deviation 4 mg/dL. Low levels 
indicate low clotting ability. In studying the 
effect of gallstones on prothrombin, the level 
of each patient in a sample is measured to see 
if there is a deficiency. Let yw be the true 
average level of prothrombin for gallstone 
patients (and assume o = 4). 


a. What are the appropriate null and alter- 
native hypotheses? 

b. Let X denote the sample average level of 
prothrombin in a sample of n = 20 ran- 
domly selected gallstone patients. Con- 
sider the test procedure with test statistic 
X and rejection region x < 17.92. What is 
the probability distribution of the test 
statistic when Hp is true? What is the 
probability of a type I error for the test 
procedure? 

c. What is the probability distribution of 
the test statistic when uw = 16.7? Using the 
test procedure of part (b), what is the 
probability that gallstone patients will be 
judged not deficient in prothrombin, when 
in fact uw = 16.7 (a type II error)? 

d. How would you change the test procedure 
of part (b) to obtain a test with signifi- 
cance level .05? What impact would this 
change have on the error probability of 
part (c)? 


511 


e. Consider the standardized test statistic 
Z = (XK — 20)/(a/./n) = (X — 20) /.8944. 
What are the values of Z corresponding to 
the rejection region of part (b)? 


11. The calibration of a scale is to be checked by 


weighing a 10-kg test specimen 25 times. 
Suppose that the results of different weigh- 
ings are independent of one another and that 
the weight on each trial is normally dis- 
tributed with o = .200 kg. Let y denote the 
true average weight reading on the scale. 


a. What hypotheses should be tested? 

b. Suppose the scale is to be recalibrated if 
either x > 10.1032 or x < 9.8968. What 
is the probability that recalibration is 
carried out when it is _ actually 
unnecessary? 

c. What is the probability that recalibration 
is judged unnecessary when in fact 
= 10.1? When p = 9.8? 

d. Let z = (x — 10)/(o/./n). For what value 
c is the rejection region of part (b) equiv- 
alent to the “two-tailed” region either 
Z> corz < -c? 

e. Ifthe sample size were only 10 rather than 
25, how should the procedure of part 
(d) be altered so that « = .05? 

f. Using the test of part (e), what would you 
conclude from the following sample data? 


9.981 10.006 
9.728 10.439 


9.857 
10.214 


10.107 9.888 
10.190 9.793 


12. A new design for the braking system on a 


certain type of car has been proposed. For the 
current system, the true average braking 
distance at 40 mph under specified conditions 
is known to be 120 ft. It is proposed that the 
new design be implemented only if sample 
data strongly indicates a reduction in true 
average braking distance for the new design. 


a. Define the parameter of interest and state 
the relevant hypotheses. 

b. Suppose braking distance for the new 
system is normally distributed with 
o = 10. Let X denote the sample average 
braking distance for a random sample of 
36 observations. Which of the following 
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rejection regions is appropriate: R, = b. Let d= u-— Up, the difference between 
{x : > 124.80}, Ro = {%:x< 115.20}, the true and hypothesized values of the 
R3 = {x: either x > 125.13 or x< 114.87}? population mean. Graph the power func- 
c. What is the significance level for the tion of the test procedure in part (a) as a 
appropriate region of part (b)? How function of d. 
would you change the region to obtain a c. Suppose the procedure of part (a) is used 
test with « = .001? to test Hp: & < [Mo versus Hz: > Lo. If 
d. What is the probability that the new Mo = 100, n = 25, and o = 5, what is the 
design is not implemented when its true probability of committing a type I error 
average braking distance is actually 115 ft when p = 99? When yp = 98? In general, 
and the appropriate region from part (b) is what can be said about the probability of a 
used? type I error when the actual value of yu is 
e. Let Z = (X — 120)/(a/,/n). What is the less than fio? Verify your assertion. 
significance level for the rejection 14, Reconsider the situation of Exercise 11 and 
region {Z: ¢ < —2.33}? For the region suppose the rejection region is {x:x> 
{z: Z < —2.88}? 10.1004 or ¥< 9.8940} = {z:7>2.51 or 
13. Let X,, ..., X, denote a random sample from Z<—2.65}. 
a normal population distribution with a a. What is « for this procedure? 
known value of o. b. What is 8 when = 10.1? When p = 9.9? 
a. For testing the hypotheses Ho: d = Lo ver- Is this desirable? 
sus H,: t > [lo (where {Uo is a fixed number), c. Graph the power function for this test 
show that the test with test statistic X and procedure as a function of the unknown A. 


rejection region X> Up +2.330/./n has 
significance level .01. 


9.2. Tests About a Population Mean 


In Sections 8.1—8.2, confidence intervals for a population mean jz were developed in two stages: first, 
for the (unrealistic) scenario when the population standard deviation o is known, then for cases when 
both y and o are unknown. We now develop test procedures for these same two cases. Later in this 
section, we provide some practical advice on the implementation of hypothesis tests for py. 


Tests About ,: for Normal Data with Known o 
Throughout this subsection, we assume that 


1. The population distribution is normal. 
2. The value of the population standard deviation o is known. 


Although the assumption that the value of o is known is rarely met in practice, this case provides a 
good starting point because of the ease with which general procedures and their properties can be 
developed. Let Xj, ..., X,, represent a random sample of size n from the normal population. Then the 
sample mean X has a normal distribution with expected value jy = and standard deviation 
ox = a/,/n. The null hypothesis is Hp: {4 = [lo, SO Up is the null value of the parameter. When Hp is 
true, {ly = [lg. Consider now the statistic Z obtained by standardizing X under the assumption that Ho 
is true: 
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_ X= 
a - (9.1) 


Substitution of the computed sample mean x into (9.1) gives z, the distance between x and [Uo 
expressed in “standard deviation units.” For example, if the null hypothesis is Ho: uw = 100, ox = 2 
and x = 103, then the test statistic value is given by z = (103 — 100)/2 = 1.5. That is, the observed 
value of x is 1.5 standard deviations (of X) above what we expect it to be when Hp is true. The statistic 
Z is a natural measure of the distance between X, the estimator of ju, and its expected value when Hp is 
true. If this distance is too great in a direction consistent with H,, the null hypothesis should be 
rejected. 

Suppose first that the alternative hypothesis has the form H,: 1 > Uo. Then an x value less than [Uo 
certainly does not provide support for H,. Such an x corresponds to a negative value of z, since x — Lg 
is negative and the divisor o/,/n is positive. Similarly, an X value that exceeds fly by only a small 
amount (corresponding to z which is positive but small) does not suggest that Hp should be rejected in 
favor of H,. The rejection of Ho is appropriate only when x considerably exceeds juj—that is, when 
the z value is positive and large. In summary, the appropriate rejection region has the form z > c for 
some relatively large positive constant c. 

As discussed in Section 9.1, the cutoff value c should be chosen to control the probability of a type 
I error at the desired level «. This is easily accomplished because the distribution of the test statistic 
Z when Ho is true is the standard normal distribution (that’s why {Wo was subtracted in standardizing): 


a = P(type I error) = P(Hp is rejected when Hp is true) 
= P(Z>c when Z~N(0,1)) = 1 — O(c) > 
Oc)=1l-asc=O'(1-wa=y 


That is, the rejection region z > z, has type I error probability «. For instance, if a level .01 test is 
desired, then Hp should be rejected if z > c = Zo, = 2.33. This test procedure is upper-tailed 
because the rejection region consists only of large values of the test statistic. 

Analogous reasoning for the alternative hypothesis H,: 1 < Uo suggests a rejection region of the 
form z < c, where c is a suitably chosen negative number (x is far below [Uo if and only if z is quite 
negative). Because Z has a standard normal distribution when Ho is true, taking c = —z, results in 
P(type I error) = «. This is a lower-tailed test. For example, z.49 = 1.28 implies that the rejection 
region z < —1.28 specifies a test with significance level .10. 

Finally, when the alternative hypothesis is H,: 4 # [o, Ho should be rejected if x is too far to either 
side of lo. This is equivalent to rejecting Ho if either z > c or z < —c. Suppose we desire « = .05. 
Then, 


05 = P(Z>c or Z< —c when Z~ N(0, 1)) 
= O(-—c) + 1 — O(c) = 2[1 — P(c)| 


Thus c is such that | — ®(c), the area under the standard normal curve to the right of c, is .025 (and 
not .05!). From Section 4.3 or Appendix Table A.3, c = 1.96, and the rejection region is {z > 1.96 
or z < —1.96}. For any a, the two-tailed rejection region z > Z,/2 or Z < —Z yz has type I error 
probability « (since area «/2 is captured under each of the two tails of the z curve). Again, the key 
reason for using the standardized test statistic Z is that because Z has a known distribution when Ap is 
true (standard normal), a rejection region with desired type I error probability is easily obtained by 
using an appropriate critical value. 


514 9 Tests of Hypotheses Based on a Single Sample 


The foregoing test procedures are summarized in the accompanying box, and the corresponding 
rejection regions are illustrated in Figure 9.3. 


z curve (probability distribution of test statistic Z when Ho is true) 


Total shaded area 
=a = P(type I error) 


Shaded area 
= @ = P(type I error) Shaded area 
=a/2 


Shaded 
area = a /2 


? as | “a (0 Zar Zu/2 


Rejection region: z<—z, Rejection region: either 


Rejection region: z > z, Z2 qj OF ZS gy 


Figure 9.3 Rejection regions for z tests: (a) upper-tailed test; (b) lower-tailed test; (c) two-tailed test 


THE ONE-SAMPLE Null hypothesis: Ho: = Uo 


z TEST on X— Mo 
Test statistic value: z = 
af\/n 


Alternative Hypothesis Rejection Region for Level « Test 


A: > Uo Z > Z, (upper-tailed test) 
Ay: UL < Lo z < ~z, (lower-tailed test) 
A, bw # Mo either z > Z,2 Or Z < —Z,o (two-tailed test) 


Use of the following sequence of steps is recommended when testing hypotheses about a parameter; 
these steps will be repeated frequently throughout the remainder of the book. The formulation of 
hypotheses (steps 1 and 2) should be done before examining the data. 


1. Identify the parameter of interest and describe it in the context of the problem situation. 

2. Determine the null value, and state the appropriate null and alternative hypotheses. 

3. Check the plausibility of any assumptions or requirements for the test procedure under consid- 
eration to be valid. 

4. Give the formula for the computed value of the test statistic (substituting the null value and the 
known values of any other parameters, but not those of any sample-based quantities). 

5. State the rejection region for the selected significance level «. 

6. Compute any necessary sample quantities, substitute into the formula for the test statistic value, 
and compute that value. 

7. Decide whether Ho should be rejected and state this conclusion in the problem context. 
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Example 9.10 If the activation temperature of an automated sprinkler system used for fire protection 
in an office building is too high, a fire could do substantial damage before water is dispersed. On the 
other hand, activation at too low a temperature could cause water damage when there is little fire 
threat. A manufacturer of sprinkler systems used for fire protection in office buildings claims that the 
true average system-activation temperature is 130°. A sample of n = 9 systems, when tested, yields a 
sample average activation temperature of 131.08 °F. If the distribution of activation times is normal 
with standard deviation 1.5 °F, does the data contradict the manufacturer’s claim at significance level 
a= .01? 


1. Parameter of interest: 4 = true average activation temperature 

2. Hypotheses: Ho: = 130 (null value = fo = 130) 
H,:  # 130 (a departure from the claimed value in either direction is of concern) 

3. Assumptions/requirements: We have assumed an underlying normal population distribution of 
activation temperatures with a known population standard deviation. 

4. Test statistic value: 


ae 5100 
— a/yn 15//n 


5. Rejection region: The form of H, implies use of a two-tailed test with rejection region either 
Z > Zoos Or Z < —Zoos. From Section 4.3 or Appendix Table A.3, zoo95 = 2.576, so we reject 
Ho if either z > 2.576 or z < —2.576. 

6. Substituting n = 9 and x = 131.08, 


_ 131.08 — 130 _ 1.08 216 


1.5//9 o 


That is, the observed sample mean is a bit more than 2 standard deviations above what would have 
been expected were Ho true. 

7. The computed value z = 2.16 does not fall in the rejection region, so Hp cannot be rejected at 
significance level .01. The data does not give sufficient evidence to conclude that the true average 
differs from the design value of 130. i 


Power, f, and Sample Size Determination for the One-Sample z Test 

The one-sample z tests are among the few in statistics for which there are simple formulas available 
for B, the probability of a type II error. Consider first the upper-tailed test with rejection region 
Z > Z, This is equivalent to X> fg +Z,-0/./n, so Ho will not be rejected if ¥< Wy + 2° o//n. 
Now let y’ denote a particular value of yz that exceeds the null value po. Then, 


B(u') = P(Ap is not rejected when p=’) 
= P(X<py +24: 0/V/n when =p’) 


X-ul Mo — Ht , 
ay ole hen = 
Gea Ogg ee 


Ho — 
P| z, 
(: . a) 


I 
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The power of the upper-tailed one-sample z test is then 1 — f(y’). As yw’ increases, Ug — pu’ becomes 
more negative, so P(t’) will be small and power will be large when p’ greatly exceeds jug (because the 
value at which © is evaluated will then be quite negative). Power and f for the lower-tailed and two- 
tailed tests are derived in an analogous manner. 

In addition to specifying the « level, investigators may prescribe a desired power level at an 
alternative value y' that is of particular concern. In the sprinkler example, company officials might 
view pw’ = 132 as a very substantial departure from Ho: = 130 and therefore wish to have a 90% 
chance of rejecting Hp (power = .90) at that temperature—that is, 6(132) = .10—in addition to, say, 
a& = .01. More generally, consider the two restrictions P(type I error) = « and f(u') = f for specified 
a, w’, and f. Then for an upper-tailed test, the sample size n should be chosen to satisfy 


o(:, - ok) =e 


This implies that 
Mo — -1 
ot — @ = 
Za + = Tn (B) 


It is easy to solve this equation for the desired n. A parallel argument yields the necessary sample size 
for lower- and two-tailed tests as summarized in the next box. 


<p 


Alternative Hypothesis Type II Error Probability f(x’) for a Level « Test 


Hy: > Mo My — 
D| 2, 
(2+ 
Ay: lk < Ho My — 
t=O) =z; 
Za + adh 


A: A Mo 


Ho — Mo — 
P| z, O| —2, 
(unt Se) (ant tm) 
where ®(z) = the standard normal cdf. For each case, power = 1 — B(w’). 
The sample size n for which a level « test also has f(y’) = f at the alternative value yu’ is 


2 
o (z at Zp) for a one-tailed 


oes Lo — bt (upper or lower) test 


[aes + a) ° for a two-tailed test 


Hg — Ww (an approximate solution) 


Example 9.11 Let y denote the true average tread life of a type of tire. Consider testing the 
hypotheses Ho: 4 = 30,000 versus H,: > 30,000 based on a sample of size n = 16 from a normal 
population distribution with o = 1500. A test with « = .01 requires z, = 29, = 2.33. The probability 
of making a type II error when pu = 31,000 is 


30,000 — 31,000 
1500/16 


(31,000) = o(2a4 = (—.34) = 3669 
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The probability of rejecting Hy when pu = 31,000, ie., the power, is 1 — .3669 = .6331. 
Since z = 1.28, the requirement that the level .01 test also have £(31,000) = .1 necessitates 


1500(2.33 + 1.28)]? F 
"= |39 000 31,000 | > 6-47) = 293 


The sample size must be an integer, so n = 30 tires should be used. | 


The One-Sample t Test 

We now modify the one-sample z test to accommodate the more realistic situation when o is 
unknown, following a path similar to what was outlined in Section 8.2. Consider the test statistic 
obtained by replacing o in (9.1) by the sample standard deviation, S: 


X— Mo 
T= 9.2 
S/n (9.2) 
Assuming X, X2, ..., X, is a random sample from a normal distribution, the rv (9.2) follows a t,-1 


distribution when Ho: tu = Lo is true. Knowledge of the test statistic’s distribution when Hp is true (the 
“null distribution”) allows us to construct a rejection region for which the type I error probability is 
controlled at the desired level. For instance, consider testing Ho: = Ho against H,: & > Mo using 
(9.2). Use of the upper-tail ¢ critical value ¢,,_; to specify the rejection region t >t,,—; implies that 


P(type I error) = P(Hp is rejected when it is true) 
= P(T > ty,-1 when T has at distribution with n — 1 df) 


=a 


The rejection region for the ¢ test differs from that of the z test only in that a f¢ critical value t, ,-| 
replaces the z critical value z,. Similar comments apply to alternative hypotheses for which a lower- 
tailed or two-tailed test is appropriate. 


THE ONE-SAMPLE Null hypothesis: Ho: “ = Uo 
t TEST X — Lo 
s/\/n 


Test statistic value: t = 


Alternative Hypothesis Rejection Region for a Level « Test 


A: > Uo t > tun (upper-tailed) 
A: Wh < Uo t < —tyn»-1 (lower-tailed) 
Hy: b # Mo either t > tyon-1 Ort < —ty2n-1 (two-tailed) 


Graphs of these rejection regions are essentially the same as those in Figure 9.3; simply replace the 
z curves and z critical values with appropriate ¢ curves and f critical values. 


Example 9.12 Particulate matter from roads contributes to pollution when those particles are 
washed into nearby waterways by rain. The size of the particles can impact the effectiveness of 
various stormwater control measures. The authors of the article “Characterizing Runoff from Roads: 
Particle Size Distributions, Nutrients, and Gross Solids” (J. Environ. Engr. 2016) took roadside 
measurements at several sites in North Carolina. For each assay they recorded dso, the median size of 
particles in the assay (a standard measure of particle size in such studies). Here are the ds9 values 
(microns) for n = 9 assays performed off I-40 near Black Mountain: 


518 9 Tests of Hypotheses Based on a Single Sample 
82.9 56.8 66.5 49.4 105.4 79.5 82.5 50.7 43.0 


Previous studies indicated that the typical ds 9 value alongside roads of this type is 44 microns. 
Does the sample data provide convincing statistical evidence that the true mean ds, value differs from 
44 microns? Let’s carry out a test using a significance level of « = .01. 

1. w = true average ds value (microns) for particulate matter assays at the Black Mountain site 
2. Ho: w= 44 

Hy: w # 44 
3. A normal probability plot (not shown) indicates that the population distribution could plausibly be 
normal, so the one-sample ¢ test will be used. 

_—X—iMy x—44 
s/n s/y/n 
5. From Appendix Table A.6, ty/2.n—-1 = t.005,8 = 3.355. So we reject Ho if either t > 3.355 or 
t < —3.355. 
6. From the data provided, x = 68.52 and s = 20.49. Substituting, 


— 68.52—44 24.52 


= = = 3.59 
20.49/V9 6.83 


7. Because the computed value ¢ = 3.59 falls in the rejection region (3.59 > 3.355), Ho is rejected at 
the .01 level. Even this small sample of data provides convincing statistical evidence that the true 
mean dso value for roadside particulates at the Black Mountain site differs from the “typical” value 
of 44 microns seen in other studies. i 


Some Practical Advice 
The validity of the one-sample ¢ test rests on Gosset’s Theorem, which assumes a normally distributed 
population. The plausibility of this assumption can be checked with a normal probability plot. But as 
we noted in Chapter 8, the ¢ distributions are “robust” against violations of normality when the 
sample size n is reasonably large. That is, when using data from a large sample (say, n > 40), the 
results of applying the one-sample ¢ test procedure should be reasonably accurate even if the 
underlying population distribution is not normal. 

We have also seen that, for v large, the z and #,_; distributions are quite similar, so that using a 
z distribution to determine rejection region cutoffs gives very similar results to the one-sample ¢ test 
procedure. In current practice, researchers typically use the one-sample ¢ test even for large samples, 
except in the extremely rare case where o is known. 

The one situation in which inferences for cannot be based on a f procedure is when the sample 
size is small and the data strongly suggests a nonnormal population. Methods to address this situation 
are considered at the end of this chapter and in Chapter 14. 


Example 9.13 A sample of bills for meals was obtained at a restaurant (by Erich Brandt). For each 
of 70 bills the tip was found as a percentage of the raw bill (before taxes). Does it appear that the 
population mean tip percentage for this restaurant exceeds the standard 15%? Here are the 70 tip 
percentages: 
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14.21 
19.12 
29.87 
13.46 
11.48 
15.23 
21.53 


20.24 
20.37 
17.92 
16.79 
13.96 
16.09 
12.76 


20.10 
15.29 
19.74 
19.03 
21.58 
19.19 
18.07 


14.94 
18.39 
22.73 
19.19 
11.94 
11.91 
14.11 


15.69 
27.55 
14.56 
19.23 
19.02 
18.21 
15.86 


15.04 
16.01 
15.16 
12.39 
17.73 
15.37 
20.67 


12.04 
10.94 
16.09 
16.89 
20.07 
16.31 
15.66 


20.16 
13.52 
16.42 
18.93 
40.09 
16.03 
18.54 


17.85 
17.42 
19.07 
13.56 
19.88 
48.77 
27.88 
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16.35 
14.48 
13.74 
17.70 
22.79 
12.31 
13.81 


Figure 9.4 shows a descriptive summary obtained from Minitab. The sample mean tip percentage 
is 17.986, which obviously is greater than 15. 


Anderson-Darting Normality Test 

A-Squared 4.17 
|_| P-Value < 0.005 

Mean 17.986 

StDev 5.937 

Variance 35.247 

Skewness 2.9391 

Kurtosis 12.0154 

N 70 
1 1 ; : Minimum 10.940 
15.0 225 30.0 37.5 45.0 ist Quartile 14.540 
Median 16.840 
es 7 ¥ 3st Quartile 19.358 
Maximum 48.770 

. 95% Confidence Interval for Mean 
95% Confidence Intervals 16.571 49.402 
Mean 4 ° 95% Confidence Interval for Median 
. 15.913 18.402 
Median | | ° | 95% Confidence Interval for StDev 
; ‘ } 5.090 7.124 
16 27 18 19 


Figure 9.4 Minitab descriptive summary for the tip data of Example 9.13 


. [= true average tip percentage 


2. Ho: w= 15 


H,: > 15 
. The distribution is positively skewed because there are some very large tips (and a normal 
probability plot therefore would not exhibit a linear pattern). But the large sample size 
(n = 70 > 40) means that the one-sample f test does not require a normal population distribution. 
x— 15 
sia 
. Using a test with a significance level .05, Ho will be rejected if t > tos5,70-1 © 1.667 (an upper- 
tailed test). 
. With n = 70, x = 17.986, and s = 5.937, 


_ 17.986 — 15 _ 2.986 


t= = = 4.21 
5.937//70  .7096 


. Since 4.21 > 1.667, Ho is rejected. There is convincing statistical evidence that the population 
mean tip percentage exceeds 15%. 1] 
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Power, f, and Sample Size Determination for the One-Sample t Test 
When the sample size is large (as in Example 9.13), power and sample size calculations for the one- 
sample f test can be approximated by the formulas provided earlier in this section. Notice that a 
plausible value of o must be specified; the sample standard deviation s may be used for this purpose, 
although power and sample size calculations are often performed prior to collecting any data. 
Alternatively, x’ values of interest are sometimes expressed as a certain number of standard 
deviations from the null value. For example, researchers may be interested in a one-quarter sd 
increase from the null value, in which case pu’ = Uo + 0.250. Re-expressing py’ in the form fp + do, 
where the value d can be positive or negative, simplifies the expressions for f and n presented earlier 
in this section so that they no longer depend explicitly on the unknown o. For instance, the formula 
for 6 in an upper-tailed one-sample z test under this substitution becomes 


Blut) = Play + do) = @(z, + = GEA) — oe, — av) 


The other formulas simplify in a similar fashion. 

Exact calculations of power and f(u') for the one-sample ¢ test (i.e., not using the normal 
approximations) are much less straightforward. This is because the test statistic T = (X — fy) /(S/V/n) 
in (9.2) does not have a ¢ distribution when Hp is false. Rather, when the true value of y is anything 
other than Wo, T has a much more complicated distribution, related to the following definition. 


DEFINITION Let Z ~ N(O, 1) and Y ~ a4 be independent random variables. For any real number 
6, the random variable 


Z+6 
Y/v 


(9.3) 


has a noncentral ¢ distribution with v degrees of freedom and noncentrality 
parameter 6. Note that when 6 = 0, the rv (9.3) matches the definition of the r dis- 
tribution from Section 6.3 and thus has a t¢, distribution. 


There is no closed-form expression for the noncentral t pdf when 6 ¥ 0, and so software is essential 
for calculations based on it. It can be shown (Exercise 38) that when yu = yp’, the one-sample f test 
statistic (9.2) has a noncentral ¢ distribution with n — 1 df and noncentrality parameter 


_ = Mo 
aeTr (9.4) 


Now consider determining the power of a lower-tailed one-sample ¢ test; the upper-tailed and two- 
tailed calculations proceed similarly. Let F(x; v, 6) denote the cdf of the noncentral t distribution. Then 


power = P(T < = tyn»—1 when p = yw rather than Ho) 


=P(1 T < —tyn—-1 when T~noncentral t, df =n—1,6= 


re ~ et) 
1 4 = Ho 
an—13 11 Man 
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Many statistical software packages can calculate this quantity once all the inputs are specified. As in 
the previous discussion, substitutions of the form ju’ = Wo + do simplify power and f expressions and 
do not require knowledge (or even an estimate) of o. 


Example 9.14 The true average voltage drop from collector to emitter of insulated gate bipolar 
transistors of a certain type is supposed to be at most 2.5 V. An investigator selects a sample of 
n = 10 such transistors and uses the resulting voltages as a basis for testing Hp: = 2.5 versus 
H,: «> 2.5 using a t test with significance level « = .05. If the standard deviation of the voltage 
distribution is o = .100, how likely is it that Ho will be (correctly) rejected when pt = 2.6? 

For the values specified, the f critical value is to5,10-1 = 1.833 and, using (9.4), the noncentrality 


parameter is 6 = (2.6 — 2.5)/(.100/V10) = 3.162. For this upper-tailed test, 


power = P(T >tos,10o-1 when p = 2.6 rather than 2.5) 
= P(T > 1.833 when T~ noncentral t, df = 9, 6 = 3.162) 
= 1 — F(1.833;9, 3.162) 


The R command pt (1.833, df =9, ncp = 3.162) reveals that F(1.833; 9, 3.162) = .1025, 
so the power under these circumstances is 1 — .1025 = .8975. The value .1025 itself is f(2.6). 

Rather than compute one power value at a time, software can be instructed to create one or more 
power curves. Figure 9.5 shows Minitab power curves using the setting of this example for three 
different sample sizes: n = 5, 10, and 20. The horizontal axis, labeled Difference, represents the 
quantity zu’ — fly. The previous power value of .8975 corresponds to the height of the n = 10 curve in 
Figure 9.5 at horizontal value pw’ — uy = 2.6 — 2.5 = .1. 


1.0 Se 


ae Sample 
ae ” Size 
/ 5 
/ # --- 10 
0.8 / ofn---to-nnnn anna flown dan nnnnnnnnnnnnnepnennnnd [ST 20 
F / Assumptions 
fe a 0.05 
/ StDev 01 
06 Fi ; / Alternative > 
7 / / 
7) / 
5 / 
/ 
a if y 
0.4 P / 
/ / 
, / 
/ 
/ 
; / 
a 
0.2 {7 
g ll 
mg 4 
- 4 
PA a 
Zs Ja 
0.0 
0.00 0.05 0.10 0.15 0.20 
Difference 


Figure 9.5 Power curves for Example 9.14 
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Figure 9.5 reveals two intuitive features of the power of this ¢ test. First, for any fixed difference 
Ll — fo, power increases with sample size: the n = 20 power curve lies above the curves for the 
smaller sample sizes. That is, for any fixed departure from Ho, a larger sample size will increase the 
likelihood of correctly detecting that Hp is false and H, is true. Second, for any fixed sample size, 
power increases as the “Difference” increases, i.e., as the distance between w' and pig grows. We are 
more likely to reject Ho: uw = 2.5 in favor of H,: up > 2.5 if the true value of y is 2.6, say, than if 
f= mW’ = 2.51, since the latter represents a very small departure from Ho and is thus much more 
difficult to detect. 

Software can also provide the sample size necessary to obtain a certain power or f at a specified 
alternative value yu’. For example, how large must n be to increase the power at pw’ = 2.6 to 95% 
(equivalently, reduce the chance of a type II error to f(2.6) = .05)? Figure 9.6 shows the result of 
making the appropriate request to Minitab, from which the answer n = 13 is obtained. 


Power and Sample Size 
1-Sample t Test 


Testing mean = null (versus > null) 
Calculating power for mean = null + 0.1 
Alpha = 0.05 Sigma = 0.1 


Sample Size Target Power Actual Power 
13 0.9500 0.9597 
Figure 9.6 Minitab sample size output for Example 9.14 a 


Exercises: Section 9.2 (15-38) 


15. Let the test statistic Z have a standard conclusion is appropriate in each of the 
normal distribution when Hp is true. Give following situations? 
the significance level for each of the fol- a. n=13,t= 1.6, a= .05 
lowing situations: b. n= 13, t= —-1.6, 7 = .05 
a. Ha: > Uo, rejection region z > 1.88 c. n=25,t=—2.6,a%=.01 
b. Ha: LW < Mo, rejection region z < —2.75 d. n= 25,t=—3.9 
c. Ha: uw A Mo, rejection region z > 2.88 18. The drying time (min) of a particular paint 
or z < —2.88 on a test board under controlled conditions 
16. Let the test statistic T have a ¢ distribution is known to be normally distributed with 
when 1p is true. Give the significance level = 75 and o = 9. A new additive has been 
for each of the following situations: developed for the purpose of improving 
a. Hy: > fos df = 15, rejection region drying time. The hypotheses Ho: B= 75 
t > 3.733 versus H,: 1 < 75 are to be tested using a 
b. Hy: <j, n= 24, rejection region random sample of n= 25 observations. 


t < -2.500 Assume drying times are still normally 

c. Hy: b ~ Uo, n= 31, rejection region distributed with ¢ = 9. - 
#13607 or? = =—1:697 a. How many standard deviations (of X) 
below the null value is x = 72.3? 

b. If x= 72.3, what is the conclusion 
using « = .01? 

c. What is « for the test procedure that 
rejects Hy when z < —2.88? 


17. The true average diameter of ball bearings 
of a certain type is supposed to be .5 in. 
A one-sample ¢ test will be carried out to 
see whether this is the case. What 
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19. 


20. 


21. 


Variable N 
Lifetime 50 738.44 38.20 


d. For the test procedure of part (c), what 
is B(70)? 

e. If the test procedure of part (c) is used, 
what n is necessary to ensure that 
(70) = .01? 

f. If a level .01 test is used with n = 100, 
what is the probability of a type II error 
when pt = 76? 

The melting point of each of 16 samples of 

a brand of hydrogenated vegetable oil was 

determined, resulting in x = 94.32. Assume 

that the distribution of melting point is 

normal with o = 1.20. 

a. Test Hp: w=95 versus H,: uw #95 
using a two-tailed level .01 test. 

b. Ifa level .01 test is used, what is 6(94), the 
probability of a type II error when pp = 94? 

c. What value of n is necessary to ensure 
that 694) = .1 when « = .01? 


Answer the following questions for the tire 

problem in Example 9.11. 

a. If x = 30,960 and a level ~ = .01 test is 
used, what is the decision? 

b. If a level .O1 test is used, what is 
B(0,500)? What is the power at 
t= 30,500 miles? 

c. If a level .O1 test is used and it is also 
required that £(30,500) = .05, what 
sample size n is necessary? 

d. If x = 30,960, what is the smallest « at 
which Ho can be rejected (based on 
n= 16)? 

Lightbulbs of a certain type are advertised 
as having an average lifetime of 750 h. The 
price of these bulbs is very favorable, so a 
potential customer has decided to go ahead 
with a purchase arrangement unless it can 
be conclusively demonstrated that the true 
average lifetime is smaller than what is 
advertised. A random sample of 50 bulbs 
was selected, the lifetime of each bulb 
determined, and the appropriate hypotheses 
were tested, resulting in the accompanying 
output. 


P-Value 
0.016 


StDev SEMean Z 
5.40 —2.14 


Mean 


22. 


23. 
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What conclusion would be appropriate for 
a significance level of .05? A significance 
level of .01? What significance level and 
conclusion would you recommend? 


The industry standard for the amount of 
alcohol poured into many types of drinks 
(e.g., gin for a gin and tonic, whiskey on 
the rocks) is 1.5 oz. Each individual in a 
sample of 8 bartenders with at least 5 years 
of experience was asked to pour rum for a 
rum and coke into a short, wide (tumbler) 
glass, resulting in the following data: 


2.00 1.78 2.16 1.91 1.70 1.67 1.83 1.48 

(Summary quantities agree with those given 

in the article “Bottoms Up! The Influence 

of Elongation on Pouring and Consumption 

Volume,” J. Consumer Res. 2003: 455- 

463.) 

a. What does a boxplot suggest about the 
distribution of the amount poured? 

b. Carry out a test of hypotheses to decide 
whether there is strong evidence for 
concluding that the true average amount 
poured differs from the industry standard. 

c. Does the validity of the test you carried out 
in (b) depend on any assumptions about 
the population distribution? If so, check 
the plausibility of such assumptions. 

d. Suppose the actual standard deviation of 
the amount poured is .20 oz. Determine 
the probability of a type II error for the 
test of (b) when the true average amount 
poured is actually (1) 1.6, (2) 1.7, 
(3) 1.8. 


Exercise 46 in Chapter 1 gave n= 26 
observations on escape time (sec) for oil 
workers in a simulated exercise, from 
which the sample mean and sample stan- 
dard deviation are 370.69 and 24.36, 
respectively. Suppose the investigators had 
believed a priori that true average escape 
time would be at most 6 min. Does the data 
contradict this prior belief? Assuming nor- 
mality, test the appropriate hypotheses 
using a significance level of .05. 
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24. Although the U.S. Food and Drug Admin- 


25. 


26. 


istration recommends against using kitchen 
utensils to dose liquid medicines, many 
people still do so, resulting in dosing errors 
and even pediatric poisonings. The letter 
“Spoons Systematically Bias Dosing of 
Liquid Medicine” (Annals of Internal Med. 
2010: 66-67) reported on an experiment 
involving a sample of 195 individuals. 
Each individual was asked to pour exactly 
5 mL of a liquid medication into a medium- 
sized tablespoon whose capacity was 
15 mL. The sample mean amount poured 
was 4.58 mL and the sample standard 
deviation was 2.55 mL. Does this data 
indicate that the true average amount 
poured is different from the desired dose? 
Test at the .05 level. 


Consider the following core wood density 
measurements (g/mm*) from a sample of 25 
canopy trees in western Thailand (“Radial 
Variation of Wood Functional Traits 
Reflect Size-Related Adaptations of Tree 
Mechanics and Hydraulics,” Functional 
Ecology 2017: 260-272) 


391.2 
543.7 
492.3 
647.8 
Wa:2 


431.0 
592.7 
454.4 
639.2 
668.7 


447.1 
546.7 
548.7 
700.4 
644.6 


315.3 
601.8 
494.9 
640.1 
717.7 


470.7 
598.8 
585.6 
620.5 
663.0 


a. Perform a hypothesis test at the .05 level 
to determine if the true mean core wood 
density differs from 600 g/mm’. 

b. This data appeared in Example 8.11, 
where a 95% CI for 4 was computed to 
be (528.0, 613.8). Explain how the 
results of your hypothesis test in part 
(a) are consistent with this confidence 
interval. 


The article “Development of Novel Indus- 
trial Laminated Planks from Sweetgum 
Lumber” (J. Bridge Engr. 2008: 64-66) 
provides the following data on the modulus 
of rupture (psi) for a sample of planks: 


6807.99 
6981.46 
6906.04 
7295.54 
7422.69 


27. 


28. 


29. 
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7637.06 
7569.75 
6617.17 
6702.76 
7886.87 


6663.28 
7437.88 
6984.12 
7440.17 
6316.67 


6165.03 
6872.39 
7093.71 
8053.26 
7713.65 


6991.41 
7663.18 
7659.50 
8284.75 
7503.33 


6992.23 
6032.28 
7378.61 
7347.95 
7674.99 


a. Perform a hypothesis test at the .01 level to 
determine if the true modulus of rupture for 
this type of plank differs from 7500 psi. 


b. A 99% confidence interval for p is 


(6929.7, 7476.7); this was calculated 
using the one-sample ¢ interval of 
Chapter 8. Explain how the results of your 
hypothesis test in part (a) are consistent 
with this confidence interval. 

On the label, Pepperidge Farm bagels are said 

to weigh four ounces each (113 g). A random 

sample of six bagels resulted in the following 
weights (in grams): 

117.6 1095 111.6 109.2 119.1 110.8 

a. Based on this sample, is there any reason 
to doubt that the population mean is at 
least 113 g? 

b. Suppose that the population mean is 
actually 110 g and that the distribution is 
normal with standard deviation 4 g. Based 
on a z test of Ay: w= 113 against 
Hi: uw < 113 with a = .05, find the proba- 
bility of rejecting Hp with six observations. 

c. Under the conditions of part (b) with 
a = .05, how many more observations 
would be needed in order for the power to 
be at least .95? 


The target thickness for silicon wafers used in 
a type of integrated circuit is 245 um. 
A sample of 50 wafers is obtained and the 
thickness of each one is determined, resulting 
in a sample mean thickness of 246.18 um 
and a sample standard deviation of 3.60 um. 
Does this data suggest that true average wafer 
thickness is something other than the target 
value? Test at the .10 level. 

A well-designed and safe workplace can 
contribute greatly to increased productivity. 
It is especially important that workers not be 


30. 
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asked to perform tasks, such as lifting, that 
exceed their capabilities. The accompanying 
data on maximum weight of lift (MAWL, in 
kg) for a frequency of four lifts/min was 
reported in the article “The Effects of Speed, 
Frequency, and Load on Measured Hand 
Forces for a Floor-to-Knuckle Lifting Task” 
(Ergonomics 1992: 833-843); subjects were 
randomly selected from the population of 
healthy males age 18-30. Assuming that 
MAWL is normally distributed, does the 
following data suggest that the population 
mean MAWL exceeds 25? Test using a sig- 
nificance level of .05. 


25.8 36.6 26.3 21.8 27.2 


The article “The Foreman’s View of Quality 
Control” (Quality Engr 1990: 257-280) 
described an investigation into the coating 
weights for large pipes resulting from a gal- 
vanized coating process. Production standards 
call for a true average weight of 200 Ib per 
pipe. The accompanying descriptive summary 
and boxplot are from Minitab. 


Variable N Mean Median TrMean StDev SE 
Mean 

ctg wt 30 206.73 206.00 206.81 6.35 1.16 

Variable Min Max QL Q3 

ctg wt 193.00 218.00 202.75 212.00 


oT Coating weight 


Ss 


190 200 210 220 


a. What does the boxplot suggest about the 
status of the specification for true average 
coating weight? 

b. A normal probability plot of the data was 
quite straight. Use the descriptive output 
to test the appropriate hypotheses. 


The amount of shaft wear (.0001 in.) after a 
fixed mileage was determined for each of 
n= 8 internal combustion engines having 
copper lead as a bearing material, resulting in 
X = 3.72 and s = 1.25. 


32. 


33: 


34. 
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a. Assuming that the distribution of shaft wear 
is normal with mean yp, use the ¢ test at level 
.05 to test Ho: « = 3.50 versus H,: p > 3.50. 
b. Using o = 1.25, what is the type II error 
probability fu’) of the test for the alter- 
native pi’ = 4.00? 
The recommended daily dietary allowance 
for zinc among males older than age 50 years 
is 15 mg/day. The article “Nutrient Intakes 
and Dietary Patterns of Older Americans: A 
National Study” (J. Gerontol. 1992: M145- 
150) reports the following summary data on 
intake for a sample of males age 65- 
74 years: n= 115, x = 11.3, and s = 6.43. 
Does this data indicate that average daily zinc 
intake in the population of all males age 65— 
74 falls below the recommended allowance? 


In an experiment designed to measure the 
time necessary for an inspector’s eyes to 
become used to the reduced amount of light 
necessary for penetrant inspection, the sam- 
ple average time for n = 9 inspectors was 
6.32 s and the sample standard deviation was 
1.65 s. It has previously been assumed that 
the average adaptation time was at least 7 s. 
Assuming adaptation time to be normally 
distributed, does the data contradict prior 
belief? Use the ¢ test with « = .1. 


A sample of 12 radon detectors of a certain 
type was selected, and each was exposed to 
100 pCi/L of radon. The resulting readings 
were as follows: 


105.6 
100.1 


90.9 91.2 
105.0 99.6 


96.9 
107.7 


96.5 91.3 
103.3 92.4 


a. Does this data suggest that the population 
mean reading under these conditions dif- 
fers from 100? State and test the appro- 
priate hypotheses using « = .05. 

b. Suppose that prior to the experiment, a 
value of o = 7.5 had been assumed. How 
many determinations would then have 
been appropriate to obtain 6 = .10 for the 
alternative #“=95? [Note: Software 
required. ] 
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35. Show that for any A > 0, when the popula- F. [Hint: Imitate the steps shown for the 
tion distribution is normal and o is known, lower-tailed case in this section.] 
the two-tailed test satisfies f(t — A) = b. Repeat part (a) for the two-tailed one- 
B(uto + A), so that B(u1’) is symmetric about LU. sample ¢ test. 

36. For a fixed alternative value y’, show that 38. Show that when = wp', the one-sample 
B(u') — 0 as n — oc for either a one-tailed t statistic (9.2) has a noncentral ¢ distribution 
or a two-tailed z test in the case of a normal with n — 1 df and noncentrality parameter 6 
population distribution with known o. given by (9.4). [Hint: (X — w’)/(a/,/n) has a 

37. Let F(x; v, 5) denote the cdf of the noncentral standard normal distribution. Re-write (9.2) 
t distribution. and follow the steps in Section 6.4 that showed 


a. Determine the power function of an why (X — p1)/(S//n) has a fy distribution. 


upper-tailed one-sample ¢ test in terms of 


9.3 Tests About a Population Proportion 


Let p denote the proportion of individuals or objects in a population who possess a specified property 
(e.g., students who graduate college debt-free or former smokers who now vape). If an individual or 
object with the property is labeled a success (S$), then p is the population proportion of successes. 
Tests concerning p will be based on a random sample of size n from the population. Provided that n is 
small relative to the population size, the rv X = the number of S’s in the sample has at least 
approximately a binomial distribution. Furthermore, if n itself is large, both X and the estimator 
P=X /n are approximately normally distributed. We first consider large-sample tests based on this 
latter fact and then turn to the small-sample case that directly uses the binomial distribution. 


Large-Sample Tests 
The estimator P = X /n is unbiased [E(P) =p], has approximately a normal distribution, and its 
standard deviation is op = \/p(1— p)/n. These facts were used in Section 8.3 to obtain a confidence 


interval for p. When Ho: p = po is true, E(P) = po and op = \/po(1 — po)/n. It then follows that when 
n is large and Hp is true, the test statistic 


Po(1 — po)/n a 


has approximately a standard normal distribution. 

Test procedures based on (9.5) can then be developed in a fashion similar to those of the first half 
of Section 9.2. For instance, if the alternative hypothesis is H,: p > po and the upper-tailed rejection 
region z > z, is used, then 


P(type I error) = P(Hp is rejected when it is true) 


= P(Z>z, when Z has approximately a standard normal distribution) ~ « 


Thus the desired level of significance « is attained by using the critical value that captures area « in the 
upper tail of the z curve. Rejection regions for the other two alternative hypotheses, lower-tailed for 
Hi: p < po and two-tailed for H,: p 4 po, are justified in an analogous manner. 
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THE ONE-PROPOR- Null hypothesis: Ho: p = po 
TION z TEST Fes 
Test statistic value: z = eee ceo 
Po(1 — po)/n 
Alternative Hypothesis Rejection Region for Level « Test 
A: Pp > Po Z > Z, (upper-tailed) 
H,: P < Po z < —z, (lower-tailed) 
Hi: p # Po either z > Z 2 Or Z < —Zy2 (two-tailed) 


These test procedures are valid provided that both npg > 10 and 
n(l — po) 2 10. 


Example 9.15 Obesity is an increasing problem in America among all age groups. The article 
“Factors Affecting Obesity and Waist Circumference Among U.S. Adults” (Prevention of Chronic 
Diseases 2019) reported that 686 individuals in a sample of 2014 adult men were found to be obese 
(a body mass index exceeding 30; this index is a measure of weight relative to height). An earlier 
survey based on people’s own assessment revealed that 20% of adult Americans considered them- 
selves obese. Does the recent data suggest that the true proportion of men who are obese is more than 
1.5 times the percentage from the self-assessment survey? Let’s carry out a test of hypotheses using a 
significance level of .10. 


1. p = the proportion of all American men who are obese. 

2. Saying that the current percentage is 1.5 times the self-assessment percentage is equivalent to the 
assertion that the current percentage is 30%, from which we have the null hypothesis Ho: p = .30. 
The phrase “more than” in the question implies that the alternative hypothesis is H,: p > .30. 

3. Since npp = 2014(.3) > 10 and ngo = 2014(.7) > 10, the large-sample z test can certainly be used. 

4. The test statistic value is 


2 = (p— 3)/V(3)CD/n 


5. The form of H, implies that an upper-tailed test is appropriate: Reject Hp if z > Z19 = 1.28. 

6. p = 686/2014 = .341, from which z = (.341 — .3)/,/(.3)(.7) /2014 = 3.98. 

7. Since 3.98 exceeds the critical value 1.28, z lies in the rejection region. This justifies rejecting the 
null hypothesis. Using a significance level of .10, it does appear that more than 30% of American 
adult men are obese. Hi 


Power, f, and Sample Size Determination for the One-Proportion z Test 

When 4H is true, the test statistic Z has approximately a standard normal distribution. Now suppose 
that Ho is not true and that p = p’. Then Z still has approximately a normal distribution (because it is a 
linear function of P), but its mean value and variance are no longer 0 and 1, respectively. Instead, 


Pp’ — Po V(Z) = p'(i—p’)/n 


ee Po(1 — po)/n ~ po(1 = po)/n 


The power for an upper-tailed test is P(Z > z, when p = p’), whereas the chance of a type II error is 
B(—p') = P(Z < z, when p = p’). These can be computed by using the given mean and variance to 
standardize and then referring to the standard normal cdf. In addition, if it is desired that the level « 
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test also have f(p’) = f for a specified value of f, this equation can be solved for the necessary n as in 
Section 9.2. General expressions for B(p’) and n are given in the accompanying box. 


Alternative Hypothesis BO’) 


H,: p > Po a (" =p = zvpesln) 
vpq/n 
He P29 ‘ o (mA eal) 
vp'q[n 
H,: p # Po © (® —p' +2y/2 evo) © ( —p —%yp sree 
Vp’ /n ved /n 


where go = 1—po, q' = 1—p’, and ®(z) = the standard normal cdf. For each case, power = 1 — B(p’). 
The sample size n for which the level « test also satisfies S(p') = f is 


vee +2zpVp'd 


2 
| one-tailed test 


tes P' — Po : 
2a/24/ +zpVp'q 
| a — iad ") two-tailed test (an approximate solution) 
P — Po 


Example 9.16 A package-delivery service advertises that at least 90% of all packages brought to its 
office by 9 a.m. for delivery in the same city are delivered by noon that day. Let p denote the true 
proportion of such packages that are delivered as advertised and consider the hypotheses Ho: p = .9 versus 
H,: p < .9. If only 80% of all packages are delivered as advertised, how likely is it that a level .01 test based 
on n = 225 packages will detect such a departure from Ho? With « = .01, po = .9, p’ = .8, and n = 225, 


(8) <1 0/2 pee GbE 


(.8)(.2) /225 


Thus the probability that Ho will be rejected using the test when p = .8—the power of the test 
procedure—is | — .0228 = .9772. Roughly 98% of all samples of size 225 will result in correct 
rejection of Ho. 

What should the sample size be to ensure 99% power when p is actually .8? The 99% power 
requirement is equivalent to f(.8) = .01. Using z, = zg = Zo1= 2.33 in the sample size formula yields 


_ ps (.9)(.1) +2.33 2) s 266 


8 — 9 


Small-Sample Tests 

Test procedures when the sample size n is small are based directly on the binomial distribution rather 
than the normal approximation. Consider the alternative hypothesis H,: p > po and again let X be the 
number of successes in the sample. Then X is the test statistic, and the upper-tailed rejection region 
has the form x > c. When Hp is true, X has a binomial distribution with parameters n and po, so 
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P(type I error) = P(Hp is rejected when it is true) 
= P(X>c when X ~ Bin(n, po)) 
= 1— P(X <c-— 1 when X ~Bin(n, po)) 
= 1-—B(c—1; n, po) 


As the critical value c decreases, more x values are included in the rejection region and P(type I error) 
increases. Because X has a discrete probability distribution, it is usually not possible to find a value of 
c for which P(type I error) is exactly the desired significance level « (e.g., .05 or .01). Instead, the 
largest rejection region of the form {c, c + 1, ..., n} satisfying 1 — B(c — 1; n, po) < a is used. 

Let p’ denote a value of p consistent with the alternative hypothesis (so p’ > po). When p = p’, 
X ~ Bin(w, p’), so 


B(p') = P(type II error when p = p’) = P(X <c when X ~ Bin(n, p')) 
= B(c—1; n, p’) 


and power = | — f(p’). Both of these are straightforward binomial probability calculations. On the 
other hand, the sample size n necessary to ensure that a level « test also has specified f at a particular 
alternative value p’ must be determined by trial and error using the binomial cdf. 

Test procedures for H,: p < po and for H,: p # po are constructed in a similar manner. In the 
former case, the appropriate rejection region has the form x < c (a lower-tailed test). The critical 
value c is the largest number satisfying B(c; n, po) < a. The rejection region when the alternative 
hypothesis is H,: p # po consists of both large and small x values. 


Example 9.17 A plastics manufacturer has developed a new type of plastic trash can and proposes 
to sell them with an unconditional 6-year warranty. To see whether this is economically feasible, 20 
prototype cans are subjected to an accelerated life test to simulate 6 years of use. The proposed 
warranty will be modified only if the sample data strongly suggests that fewer than 90% of such cans 
would survive the 6-year period. Let p denote the proportion of all cans that would survive the 
accelerated test. The relevant hypotheses are then Hp: p = .9 versus H,: p < .9. A decision will be 
based on the test statistic X, the number among the 20 that survive. If the desired significance level is 
a = .05, then c must satisfy B(c; 20, .9) < .05. From Appendix Table A.1, B(15; 20, .9) = .043 and 
B(16; 20, .9) = .133. The appropriate rejection region is therefore x < 15. If the accelerated test 
results in x = 14, Ho would be rejected in favor of H,, necessitating a modification of the proposed 
warranty. The probability of a type II error for the alternative value p’ = .8 is 


B(.8) = P(Ap is not rejected when X ~ Bin(20, .8)) 
= P(X > 15 when X ~ Bin(20, .8)) 
= 1—B(15; 20, .8) = 1 — .370 = .630 


That is, when p = .8, 63% of all samples consisting of n = 20 cans would result in Ho being 
incorrectly not rejected; the power of this test procedure is just 37%. This error probability is high 
because 20 is a small sample size and p’ = .8 is close to the null value pop = .9. Hi 
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Exercises: Section 9.3 (39-48) 


39. 


40. 


41. 


State DMV records indicate that of all 
vehicles undergoing emissions testing dur- 
ing the previous year, 70% passed on the 
first try. A random sample of 200 cars 
tested in a particular county during the 
current year yields 124 that passed on the 
initial test. Does this suggest that the true 
proportion for this county during the cur- 
rent year differs from the previous state- 
wide proportion? Test the relevant 
hypotheses using « = .05. 


Natural cork in wine bottles is subject to 
deterioration, and as a result wine in such 
bottles may experience contamination. The 
article “Effects of Bottle Closure Type on 
Consumer Perceptions of Wine Quality” 
(Amer. J. Enology Viticulture 2007: 182- 
191) reported that in a tasting of commercial 
chardonnays, 16 of 91 bottles were consid- 
ered spoiled to some extent by cork- 
associated characteristics. Does this data 
provide strong evidence for concluding that 
more than 15% of all such bottles are con- 
taminated in this way? Carry out a test of 
hypotheses using a significance level of .10. 


A manufacturer of nickel-hydrogen batter- 

ies randomly selects 100 nickel plates for 

test cells, cycles them a specified number of 
times, and determines that 14 of the plates 
have blistered. 

a. Does this provide compelling evidence 
for concluding that more than 10% of all 
plates blister under such circumstances? 
State and test the appropriate hypothe- 
ses using a significance level of .05. In 
reaching your conclusion, what type of 
error might you have committed? 

b. If it is really the case that 15% of all 
plates blister under these circumstances 
and a sample size of 100 is used, how 
likely is it that the null hypothesis of 
part (a) will not be rejected by the level 
.05 test? Answer this question for a 
sample size of 200. 


42 


43. 


44. 
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c. How many plates would have to be 
tested to have f(.15) = .10 for the test 
of part (a)? 
. A random sample of 150 recent donations 
at a blood bank reveals that 82 were type A 
blood. Does this suggest that the actual 
percentage of type A donations differs from 
40%, the percentage of the population 
having type A blood? Carry out a test of the 
appropriate hypotheses using a significance 
level of .01. Would your conclusion have 
been different if a significance level of .05 
had been used? 
A university library ordinarily has a com- 
plete shelf inventory done once every year. 
Because of new shelving rules instituted the 
previous year, the head librarian believes it 
may be possible to save money by post- 
poning the inventory. The librarian decides 
to select at random 1000 books from the 
library’s collection and have them searched 
in a preliminary manner. If evidence indi- 
cates strongly that the true proportion of 
misshelved or unlocatable books is <.02, 
then the inventory will be postponed. 


a. Among the 1000 books searched, 15 
were misshelved or unlocatable. Test 
the relevant hypotheses and advise the 
librarian what to do (use « = .05). 

b. If the true proportion of misshelved and 
lost books is actually .01, what is the 
probability that the inventory will be 
(unnecessarily) taken? 

c. If the true proportion is .05, what is the 
probability that the inventory will be 
postponed? 

The authors of the article “Luck of the 

Draw: Creating Chinese Brand Names” 

(J. of Advertising Res. 2008: 523-530) 

counted the number of “strokes” in the 

characters for the names of 1202 Chinese 
brand names. Certain totals for the number 
of strokes are considered lucky in Chinese 
culture, and the researchers hypothesized 
that a majority of Chinese brand names 
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45. 


46. 


would have a “lucky” number of strokes. 
Among the 1202 names sampled, 715 had a 
“lucky” number of strokes. Test the 
researchers’ hypothesis at the « = .01 sig- 
nificance level. 


A plan for an executive traveler’s club has 
been developed by an airline on the premise 
that 5% of its current customers would 
qualify for membership. A random sample 
of 500 customers yielded 40 who would 
qualify. 


a. Using this data, test at level .01 the null 
hypothesis that the company’s premise 
is correct against the alternative that it is 
not correct. 

b. What is the probability that when the test 
of part (a) is used, the company’s pre- 
mise will be judged correct when in fact 
10% of all current customers qualify? 


Each of a group of 20 intermediate tennis 
players is given two rackets, one having 
nylon strings and the other synthetic gut 
strings. After several weeks of playing with 
the two rackets, each player will be asked to 
state a preference for one of the two types 
of strings. Let p denote the proportion of all 
such players who would prefer gut to 
nylon, and let X be the number of players in 
the sample who prefer gut. Because gut 
strings are more expensive, consider the 
null hypothesis that at most 50% of all such 
players prefer gut. We simplify this to Ho: 
p = .5, planning to reject Hy only if sample 
evidence strongly favors gut strings. 


a. Which of the rejection regions { 15, 16, 17, 
18, 19, 20}, {0, 1,2, 3,4, 5}, or {0, 1, 2, 3, 
17, 18, 19, 20} is most appropriate, and 
why are the other two not appropriate? 

b. What is the probability of a type I error 
for the chosen region of part (a)? Does 
the region specify a level .05 test? Is it 
the best level .05 test? 

c. If 60% of all enthusiasts prefer gut, 
calculate the probability of a type II 
error using the appropriate region from 
part (a). Repeat if 80% of all enthusiasts 
prefer gut. 


47. 


48. 
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d. If 13 out of the 20 players prefer gut, 
should Hp be rejected using a signifi- 
cance level of .10? 


A manufacturer of plumbing fixtures has 
developed a new type of washerless faucet. 
Let p = P(a randomly selected faucet of 
this type will develop a leak within 2 years 
under normal use). The manufacturer has 
decided to proceed with production unless 
it can be determined that p is too large; the 
borderline acceptable value of p is specified 
as .10. The manufacturer decides to subject 
n of these faucets to accelerated testing 
(approximating 2 years of normal use). 
With X = the number among the n faucets 
that leak before the test concludes, pro- 
duction will commence unless the observed 
X is too large. It is decided that if p = .10, 
the probability of not proceeding should be 
at most .10, whereas if p = .30 the proba- 
bility of proceeding should be at most .10. 
Can n=10 be used? n= 20? n= 25? 
What is the appropriate rejection region for 
the chosen n, and what are the actual error 
probabilities when this region is used? 


Scientists have recently become concerned 

about the safety of Teflon cookware and 

various food containers because 
perfluorooctanoic acid (PFOA) is used in 
the manufacturing process. An article in the 

July 27, 2005, New York Times reported 

that of 600 children tested, 96% had PFOA 

in their blood. According to the FDA, 90% 

of all Americans have PFOA in their blood. 

a. Does the data on PFOA incidence 
among children suggest that the per- 
centage of all children who have PFOA 
in their blood exceeds the FDA per- 
centage for all Americans? Carry out an 
appropriate test of hypotheses. 

b. If 95% of all children have PFOA in 
their blood, how likely is it that the null 
hypothesis tested in (a) will be rejected 
when a significance level of .01 is 
employed? 

c. Referring back to (b), what sample size 
would be necessary for the relevant 
probability to be .10? 
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Using the rejection region method to test hypotheses entails first selecting a significance level w. Then 
after computing the value of the test statistic, the null hypothesis Hp is rejected if the value falls in the 
rejection region and is otherwise not rejected. We now consider another way of reaching a conclusion 
in a hypothesis-testing analysis. This alternative approach is based on calculation of a certain 
probability called a P-value. One advantage is that the P-value provides an intuitive measure of the 
strength of evidence in the data against Hp. 


DEFINITION The P-value is the probability, calculated assuming that the null hypothesis is true, 
of obtaining a value of the test statistic at least as contradictory to Ho as the value 
calculated from the available sample. 


The definition is quite a mouthful! Here are some key points: 


The P-value is a probability. 

This probability is calculated assuming that Hp is true. 

The P-value is a function of the sample data. 

To determine the P-value, we must decide which values of the test statistic are “at least as 
contradictory to Hp” as the value obtained from our sample. 


Example 9.18 Urban storm water can be contaminated by many sources, including discarded 
batteries. When ruptured, these batteries release metals of environmental significance. The paper 
“Urban Battery Litter” (J. Environ. Engr. 2009: 46-57) presented summary data for characteristics of 
a variety of batteries found in urban areas around Cleveland. A sample of 51 Panasonic AAA batteries 
gave a sample mean zinc mass of 2.06 g and a sample standard deviation of .141 g. Does this data 
provide compelling evidence for concluding that the population mean zinc mass exceeds 2.0 g? 

With yw denoting the true average zinc mass (g) for such batteries, the relevant hypotheses are 
Ho: = 2.0 versus H,: pp > 2.0. The sample size is large enough so that the one-sample ¢ test can be 
used without making any specific assumption about the shape of the population distribution. The test 
Statistic value is 


_¥-2.0 2.06 —2.0 


t= ar = Maiyah 


Now we must decide which values of ¢ are “at least as contradictory to Hp.” Let’s first consider an 
easier task: Which values of x are at least as contradictory to the null hypothesis as 2.06 g, the mean 
of the observations in our sample? Because > appears in H,, it should be clear that 2.10 g is at least as 
contradictory to Ho as is 2.06, so is 2.25, and so in fact is any x value that exceeds 2.06. An x value 
that exceeds 2.06 g corresponds to a value of ¢ that exceeds 3.04. Thus the P-value is 


P-value = P(T >3.04 when p = 2.0) 


Since the test statistic T was created by subtracting the null value 2.0 in the numerator, when p = 2.0 
(i.e., when Hp is true) T has approximately a f distribution with 51 — 1 = 50 df. As a result, 
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P-value = P(T > 3.04 when p = 2.0) 
~ area under the fsq curve to the right of 3.04 
x .0019 


The area under the ¢ curve was determined using software. Hl 


We will shortly illustrate how to determine the P-value for any z or f test; that is, any test where the 
reference distribution is the standard normal or some f distribution. For the moment, though, let’s 
focus on reaching a conclusion once the P-value is available. Because it is a probability, the P-value 
must be between 0 and 1. What kinds of P-values provide evidence against the null hypothesis? 
Consider two specific instances: 


e P-value = .250: In this case, fully 25% of all possible test statistic values are more contradictory to 
Hp than the one that came out of our sample. So our data is not all that contradictory to the null 
hypothesis: even if Hp is true, we’d see “more extreme” data than ours one-quarter of the time. 

e P-value = .0019: Here, only .19% of all possible test statistic values are at least as contradictory to 
Ho as what we obtained. Thus the sample appears to be highly contradictory to the null hypothesis. 


More generally, the smaller the P-value, the more evidence there is in the sample data against the 
null hypothesis and for the alternative hypothesis. That is, Hy should be rejected in favor of H, when 
the P-value is sufficiently small. So what constitutes “sufficiently small’? 

Whatever rule we use, it should not result in decisions that contradict the rejection region pro- 
cedures we have seen previously. Consider, for instance, an upper-tailed z test at the « = .01 level, for 
which the z critical value is z.9, = 2.33. Using precisely the logic of the previous example, the P-value 
of the hypothesis test should be the area under the standard normal curve to the right of the observed 
test statistic value z. Two such possible P-values are illustrated in Figure 9.7. But the rejection region 
already prescribes that we should reject Ho if z > 2.33 and fail to reject Ho if z < 2.33. Figure 9.7a 
shows that for any z value in the rejection region, the resulting P-value will be < .01; conversely, as 
seen in Figure 9.7b, the P-value will be > .01 precisely when z < 2.33, instructing us to not reject Hp. 


P-value < .01 


\ 


2.33% z 2.33 


P-value > .01 


Figure 9.7 P-values for an upper-tailed z test: (a) P-value < .01 if z > 2.33 (reject Hp); 
(b) P-value > .01 if z < 2.33 (do not reject Ho) 


The preceding illustration generalizes to other tests (lower- and two-tailed, t as well as z) and other 
significance levels, leading to the following decision rule. 


534 9 Tests of Hypotheses Based on a Single Sample 


DECISION RULE BASED _ Select a significance level « (as before, the desired type I error 
ON THE P-VALUE probability). Then reject Ho if P-value <«; do not reject Ho if 
P-value > a. 


Figure 9.8 provides an easy way to visualize the decision rule. The calculation of the P-value depends 
on whether the test is upper-, lower-, or two-tailed. However, once it has been calculated, the 
comparison with « does not depend on which type of test was used. 


Reject H, Fail to Reject H, 


j \ 
a a es 7 


0 a l 


Figure 9.8 Comparing « and the P-value 


In Example 9.18, we calculated P-value = .0019. Then using a significance level of .01, we would 
reject the null hypothesis in favor of the alternative hypothesis because .0019 < .01. However, 
suppose we had selected a significance level of .001, which requires more substantial evidence from 
the data before Ho can be rejected. In this case we would not reject Ho because .0019 > .001. Note 
that « should be specified before data is collected and the P-value calculated. It would be unethical to 
compute the P-value first and then select a significance level that would guarantee the desired 
outcome (e.g., deliberately choosing « greater than the P-value so that Hp is rejected). 


Example 9.19 The true average time to initial relief of pain for a best-selling pain reliever is known 
to be 10 min. Let yi denote the true average time to relief for a company’s newly developed reliever. 
Suppose that when data from an experiment involving the new pain reliever was analyzed, the 
P-value for testing Ho: u = 10 versus H,: “ < 10 was calculated as .0384. Since the P-value is less 
than « = .05, Hp would be rejected by anyone carrying out the test at level .05. However, at level .01, 
Ho would not be rejected because .0384 > .01. Again, « should be specified in advance of analyzing 
the data. Hi 


The most widely used statistical computer packages automatically include a P-value when a 
hypothesis-testing analysis is performed. A conclusion can then be drawn directly from the output, 
without reference to a table of critical values. With the P-value in hand, an investigator can see at a 
quick glance whether Ho should be rejected at the prescribed « level. In addition, knowing the P-value 
allows a decision maker to distinguish between a close call (e.g., « = .05, P-value = .0498) and a 
very clear-cut conclusion (e.g., « = .05, P-value = .0003), something that would not be possible just 
from the statement “Hp can be rejected at significance level .05.” 


P-Values for z Tests 

The P-value for a z test (i.e., one based on a test statistic whose distribution when Hp is true is at least 
approximately standard normal) is easily determined from the information in Appendix Table A.3. 
Consider an upper-tailed test and let z denote the computed value of the test statistic Z. As illustrated 
in Figure 9.7, the P-value is just the area to the right of the computed value z under the standard 
normal curve. The corresponding cumulative area is ®(z), so in this case P-value = 1 — ®(z). An 
analogous argument for a lower-tailed test shows that the P-value is the area captured by the 
computed value z in the lower tail of the standard normal curve, ®(z). 
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More care must be exercised in the case of a two-tailed test. Suppose first that z is positive. We 
know to reject Ho if and only if z > z,2, which occurs precisely when | — ®(z)<«/2, or 
2[1 — ®(z)] <«. Comparing this to the earlier decision rule, we infer that the P-value is precisely the 
quantity 2[1 — ®(z)]. If z is negative, a similar argument leads to P-value = 2[1 — ®(—z)]. Since 
—z = |z| when z is negative, the P-value = 2[1 — ®(|z|)] for either positive or negative z. 


1 — D(z) for an upper-tailed test 
P-value = ¢ (z) for a lower-tailed test 


2[1 — ®(|z|)] for a two-tailed test 


Each of these is the probability of getting a value at least as extreme as what was obtained (assuming 
Ho true). The three cases are illustrated in Figure 9.9. 


zZ curve 


P-value = area in upper tail = 1 — ®(z) 
1. Upper-tailed test 
H, contains the inequality > 


a, 


Calculated z 


Z curve 


f 


P-value = area in lower tail 
2. Lower-tailed test = D(z) 
H, contains the inequality < 


h 0 


Calculated z 


P-value = sum of area in two tails = 2[1 — ®(Izl)] 


Z curve 


3. Two-tailed test ye 


H, contains the inequality # 
1 


A 0 A 


Calculated z, —z 


Figure 9.9 Determination of the P-value for a z test 


The next example illustrates the use of the P-value approach to hypothesis testing by means of a 
sequence of steps modified from our previously recommended sequence. 


Example 9.20 A Gallup poll (reported July 15, 2019) found that 29% of 1018 U.S. adults support 
statehood for the District of Columbia. Thirty years prior, 31% of U.S. adults held this opinion. Does 
the 2019 sample provide convincing statistical evidence at the « = .10 level that the proportion of 
U.S. adults supporting DC statehood changed over those thirty years? 


1. Parameter of interest: p = proportion of all U.S. adults in 2019 who support statehood for the 
District of Columbia 
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2. Null hypothesis: Ho: p = .31 (no change since 1989) 
Alternative hypothesis: H,: p # .31 

3. Assuming Hp is true, np = npo = 1018.31) > 10 and ng = ng = 10181 — .31) > 10. Thus, a 
one-proportion z test may be applied. 


4. Formula for test statistic value: z = ae pos! 
Vpoqo/n — /(-31)(.69)/n 
29 — 31 


5. Calculation of test statistic value: z = 1.38 


JV(31(.69)/1018 


6. Determination of P-value: Because the test is two-tailed, 
P-value = 2[1 — ®(| —1.38])] = .1676 


7. Conclusion: Using a significance level of .10, Ho would not be rejected since .1676 > .10. At this 
significance level, there is insufficient evidence to conclude that the proportion of U.S. adults who 
support DC becoming a state changed over thirty years. fl 


P-Values for t Tests 

Just as the P-value for a z test is a z curve area, the P-value for a ¢ test will be at curve area. Figure 9.9 
illustrates the three possible cases: simply replace each z value or z curve with a ¢ value or ¢ curve. The 
number of df for the one-sample f¢ test is n — 1. 

The table of ¢ critical values used previously for confidence and prediction intervals doesn’t 
contain enough information about any particular ¢ distribution to allow for accurate determination of 
desired areas, so we have included another ¢ table in Appendix Table A.7, one that contains a 
tabulation of upper-tail ¢ curve areas. Each different column of the table is for a different number of df, 
and the rows are for calculated values of the test statistic t ranging from 0.0 to 4.0 in increments of .1. 
For example, the number .074 appears at the intersection of the 1.6 row and the 8 df column, so the 
area under the 8 df curve to the right of 1.6 (an upper-tail area) is .074. Because f curves are 
symmetric, .074 is also the area under the 8 df curve to the left of —1.6 (a lower-tail area). 

Suppose, for example, that a test of Ho: . = 100 versus H,: « > 100 is based on the 8 df f distribution. 
If the calculated value of the test statistic is t = 1.6, then the P-value for this upper-tailed test is .074. 
Because .074 exceeds .05, we would not be able to reject Ho at a significance level of .05. If the 
alternative hypothesis is H,: 4 < 100 and a test based on 20 df yields t = —3.2, then Appendix Table A.7 
shows that the P-value is the captured lower-tail area .002. The null hypothesis can be rejected at either 
level .05 or .01. Finally, for H,: 1 A 100 if attest is based on 20 df and t = 3.2, then the P-value for this 
two-tailed test is 2(.002) = .004. This would also be the P-value for t = —3.2. The tail area is doubled 
because values both larger than 3.2 and smaller than —3.2 are more contradictory to Hp than what was 
calculated (values farther out in either tail of the t curve; see the bottom graph in Figure 9.9). 


Example 9.21 The recommended daily intake of calcium for adults ages 18-30 is 1000 mg/day. 
The article “Dietary and Total Calcium Intakes Are Associated with Lower Percentage Total Body 
and Truncal Fat in Young, Healthy Adults” (J. Amer. College of Nutr. 2011: 484-490) reported the 
following summary data for a sample of 76 healthy Caucasian males from southwestern Ontario, 
Canada: n = 76, x = 1093, s = 477. Let’s carry out a test at significance level .01 to see whether the 
population mean daily intake exceeds the recommended value. 


1. 4 =the mean daily calcium intake for this population (healthy Caucasian males from south- 
western Ontario) 
2. Ho: pw = 1000 
H,:; « > 1000 
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3. Since n = 76 > 40, the one-sample ¢ test is valid here (even if the calcium intake distribution is not 
normally distributed). 


4 7 ¥~ 1000 
s/n 
5, — 1093-1000 _ | 2, 


— ATT/V16 
6. The P-value is the area under the #75 curve to the right of 1.70 (the inequality in H,, implies that the 
test is upper-tailed). From Table A.7, this area is between .047 (the upper-tail area at 60 df) and 
.046 (the upper-tail area at 120 df). Software gives a P-value of .0467. 
7. Because this P-value is larger than .01, Hp cannot be rejected. There is not compelling evidence to 
conclude at significance level .01 that the population mean daily intake exceeds the recommended 
value (even though the sample mean does so). Note that the opposite conclusion would result from 
using a significance level of .05. But the smaller « that we used requires more persuasive evidence 
from the data before rejecting Hp. # 


More on Interpreting P-Values 

The P-value resulting from carrying out a test on a selected sample is not the probability that Ho is 
true, nor is it the probability of rejecting the null hypothesis. Once again, it is the probability, 
calculated assuming that Hp is true, of obtaining a test statistic value at least as contradictory to the 
null hypothesis as the value that actually resulted. For example, consider testing Ho: u = 50 against 
Ho: pw < 50 using a lower-tailed z test. If the calculated value of the test statistic is z = —2.00, then 


P-value = ®(z) = ®(—2.00) = .0228 


But if a second sample is selected, the resulting value of z will almost surely be different from —2.00, 
so the corresponding P-value will also likely differ from .0228. Because the test statistic value itself 
varies from one sample to another, the P-value will also vary from one sample to another. That is, the 
test statistic is a random variable, and so the P-value will also be a random variable. A first sample 
may give a P-value of .0228, a second sample result in a P-value of .1175, a third yield .0606 as the 
P-value, and so on. 

If Ho is false, we hope the P-value will be close to 0 so that the null hypothesis can be rejected. On 
the other hand, when Ho is true, we’d like the P-value to exceed the selected significance level so that 
the correct decision to not reject Ho is made. The next example presents simulations to show how the 
P-value behaves both when the null hypothesis is true and when it is false. 


Example 9.22 The fuel efficiency (mpg) of any particular new vehicle under specified driving 
conditions may not be identical to the EPA figure that appears on the vehicle’s sticker. Suppose that 
four different vehicles of a particular type are to be selected and driven over a certain course, after 
which the fuel efficiency of each one is to be determined. Let denote the true average fuel efficiency 
under these conditions. 

Consider testing Hp: u = 30 versus Ho: > 30 using the one-sample f test based on the resulting 
sample. Since the test is based on n — 1 = 3 degrees of freedom, the P-value for an upper-tailed test is 
the area under the f curve with 3 df to the right of the calculated f. 

Let’s first suppose that Hp is true. We used software to generate 10,000 different samples, each 
containing 4 observations, from a normal population distribution with mean value pw = 30 and 
standard deviation o = 2. The first sample and resulting summary quantities were 
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x, = 30.830, x2 = 32.232, x3 = 30.276, x4 = 27.718 > 


= A0oe we ieiet gon 709 
1.8864/\/4 


The P-value is the area under the 13 curve to the right of .2799, which according to software is .3989. 
Using a significance level of .05, the null hypothesis would of course not be rejected. The values of 
t for the next four samples were —1.7591, .6082, —.7020, and 3.1053, with corresponding P-values 
912, .293, .733, and .0265. 

Figure 9.10a (p. 575) shows a histogram of the 10,000 P-values from this simulation experiment. 
About 4.5% of these P-values are in the first class interval from 0 to .05. Thus when using a 
significance level of .05, the null hypothesis is rejected in roughly 4.5% of these 10,000 tests. If we 
continue to generate samples and carry out the test for each one at significance level .05, in the long 
run 5% of the P-values will be in the first class interval—because when Hp is true and a test with 
significance level .05 is used, by definition the probability of rejecting Ho (i.e., of committing a type I 
error) is .05. 

Looking at the histogram, it appears that the distribution of P-values is relatively flat. In fact, it can 
be shown that when Hp is true, the probability distribution of the P-value is a uniform [0, 1] 
distribution. Since P(U < .05) = .05 for a Uniform [0, 1] rv, we again have that the probability of 
rejecting Hy when it is true is .05, the chosen significance level. 

Now consider what happens when Hp is false because = 31. We again generated 10,000 different 
samples of size 4, but now each from a normal distribution with « = 31 and o = 2. The ¢ statistic and 
P-value were calculated as before for each sample, and Figure 9.10b gives a histogram of the 10,000 
resulting P-values. The shape of this histogram is quite different from that of Figure 9.10a: there is a 
much greater tendency for the P-value to be small (closer to 0) when 4 = 31 than when yu = 30. Again 
Ho is rejected at significance level .05 whenever the P-value is at most .05 (in the first class interval). 
Unfortunately this is the case for only about 19% of the 10,000 P-values. So only about 19% of the 
10,000 tests correctly reject the null hypothesis (an estimate of the test’s power); for the other 81%, a 
type II error is committed. The difficulty is that the sample size is extremely small and 31 is not very 
different from the value asserted by the null hypothesis. 

Figure 9.10c illustrates what happens to the P-value when Hp is false because pz = 32 (still with 
n= 4 and o = 2). The histogram is even more concentrated toward values close to 0 than was the 
case when yt = 31. In general, as 4p moves further to the right of the null value 30, the distribution of 
the P-values will become more and more concentrated on values close to 0. Even here, a bit fewer 
than 50% of the 10,000 P-values are smaller than .05. So it is still slightly more likely than not that 
the null hypothesis is incorrectly not rejected, principally because n is so small. 

The big idea of this example is that because the value of any test statistic is random, the P-value 
will also be a random variable and thus have a distribution. The farther the actual value of the 
parameter is from the value specified by the null hypothesis, the more the distribution of the P-value 
will be concentrated on values close to 0 and the greater the chance that the test will correctly reject 
Hp (corresponding to smaller f). 
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a Ay: ua 30 is true. 


Percent 


0 P-value 


b  H, is false because u = 31. 
Percent 
20 


15 


10 


0 P-value 
c 4A, is false because yw = 32. 
Percent 


50 


0 P-value 
0.00 0.15 0.30 0.45 0.60 0.75 0.90 


Figure 9.10 P-value simulation results for Example 9.22 
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Exercises: Section 9.4 (49-63) 


49. 


50. 


51. 


52. 


53. 


54. 


For which of the given P-values would the 
null hypothesis be rejected when perform- 
ing a level .05 test? 

a. 001 b. .021 c. .078 d..047 e. .148 
Pairs of P-values and significance levels, «, 
are given. For each pair, state whether the 
observed P-value would lead to rejection of 
Ho at the given significance level. 


a. P-value = .084, a = .05 
b. P-value = .003, « = .001 
c. P-value = .498, a = .05 
d. P-value = .084, « = .10 
e. P-value = .039, a = .O1 
f. P-value = .218, « = .10 


Let y denote the mean reaction time to a 
certain stimulus. For a one-sample z test of 
Ho: w = 5 versus H,: pp > 5 (i.e., assuming 
o is known), find the P-value associated 
with each of the given values of the z test 
Statistic. 


a. 1.42 b..90 c. 1.96 d.248 e.-—11 


Newly purchased tires of a certain type are 
supposed to be filled to a pressure of 
30 Ib/in*. Let ft denote the true average 
pressure. Find the P-value associated with 
each given one-sample z statistic value for 
testing Ho: w = 30 versus Hz: hp # 30. 


a. 2.10 b.-1.75 c.-.55 d. 1.41 e.-5.3 


Give as much information as you can about 
the P-value of a ¢ test in each of the fol- 
lowing situations: 

a. Upper-tailed test, df = 8, t = 2.0 

b. Lower-tailed test, df = 11, t = —2.4 

c. Two-tailed test, df = 15, t = —1.6 

. Upper-tailed test, df = 19, r= —.4 

. Upper-tailed test, df = 5, t = 5.0 

f. Two-tailed test, df = 40, t = —4.8 

The paint used to make lines on roads must 


reflect enough light to be clearly visible at 
night. Let js denote the true average 


oof 


55. 


56. 
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reflectometer reading for a new type of 
paint under consideration. A test of 
Ho: « = 20 versus H,: > 20 will be based 
on a random sample of size n from a normal 
population distribution. What conclusion is 
appropriate in each of the following situa- 
tions? 

a. n= 15,t=3.2, a= .05 

b. n=9,t=1.8,0=.01 
c.n=24,t=-.2 


Let yz denote true average serum receptor 
concentration for all pregnant women. The 
average for all women is known to be 5.63. 
The article “Serum Transferrin Receptor for 
the Detection of Iron Deficiency in Preg- 
nancy” (Amer. J. Clin. Nutrit. 1991: 1077- 
1081) reports that P-value > .10 for a test 
of Ho: « = 5.63 versus H,: pp ~ 5.63 based 
on n = 176 pregnant women. Using a sig- 
nificance level of .01, what would you 
conclude? 

An aspirin manufacturer fills bottles by 
weight rather than by count. Since each 
bottle should contain 100 tablets, the aver- 
age weight per tablet should be 5 grains. 
Each of 100 tablets taken from a very large 
lot is weighed, resulting in a sample aver- 
age weight per tablet of 4.87 grains and a 
sample standard deviation of .35 grain. 
Does this information provide strong evi- 
dence for concluding that the company is 
not filling its bottles as advertised? Test the 
appropriate hypotheses using « = .01 by 
first computing the P-value and then com- 
paring it to the specified significance level. 


Because of variability in the manufacturing 
process, the actual yielding point of a 
sample of mild steel subjected to increasing 
stress will usually differ from the theoretical 
yielding point. Let p denote the true pro- 
portion of samples that yield before their 
theoretical yielding point. If on the basis of 
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58. 


59. 


a sample it can be concluded that more than 
20% of all specimens yield before the the- 
oretical point, the production process will 
have to be modified. 


a. If 15 of 60 specimens yield before the 
theoretical point, what is the P-value 
when the appropriate test is used, and 
what would you advise the company to 
do? 

b. If the true percentage of “early yields” is 
actually 50% (so that the theoretical 
point is the median of the yield distri- 
bution) and a level .01 test is used, what 
is the probability that the company 
concludes a modification of the process 
is necessary? 


Standard-size boxes for a particular brand 
of cereal indicate a net weight of 14 oz. 
A consumer group purchases a random 
sample of 50 such cereal boxes and weighs 
their contents. If the average of these 50 
weights is 13.8 oz with a standard devia- 
tion of 1.1 oz, does the consumer group 
have sufficient evidence to conclude that 
the cereal company is under-filling its 
packages? Test at the « = .05 level using 
the P-value method. 


A random sample of soil specimens was 
obtained, and the amount of organic matter 
(%) in the soil was determined for each 
specimen, resulting in the accompanying 
data (from “Engineering Properties of 
Soil,” Soil Sci. 1998: 93-102). 


1.10 
0.14 
3.98 
0.76 


5.09 
4.47 
3.17 
1.17 


0.97 
1.20 
3.03 
1.57 


1.59 
3.50 
2.21 
2.62 


4.60 
5.02 
0.69 
1.66 


0.32 
4.67 
4.47 
2.05 


0.55 1.45 
5.22 2.69 
3.31 1.17 


The values of the sample mean and sample 
standard deviation are 2.481 and 1.616, 
respectively. Does this data suggest that the 
true average percentage of organic matter in 
such soil is something other than 3%? 
Carry out a test of the appropriate 
hypotheses at significance level .10 by first 


60. 


61. 


62. 
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determining the P-value. Would your con- 
clusion be different if «= .05 had been 
used? [Note: A normal probability plot of 
the data shows an acceptable pattern in 
light of the reasonably large sample size.] 


Repeat the analysis of Exercise 40 using the 
P-value method. Do you arrive at the same 
conclusion? 


A pen has been designed so that true 
average writing lifetime under controlled 
conditions (involving the use of a writing 
machine) is at least 10 h. A random sample 
of 18 pens is selected, the writing lifetime 
of each is determined, and a normal prob- 
ability plot of the resulting data supports 
the use of a one-sample f test. 


a. What hypotheses should be tested if the 
investigators believe a priori that the 
design specification has been satisfied? 

b. What conclusion is appropriate if the 
hypotheses of part (a) are tested, 
t = —2.3, and « = .05? 

c. What conclusion is appropriate if the 
hypotheses of part (a) are tested, 
t = —1.8, and « = .01? 

d. What should be concluded if the 
hypotheses of part (a) are tested and 
t= —3.6? 

A spectrophotometer used for measuring 

CO concentration [ppm (parts per million) 

by volume] is checked for accuracy by 

taking readings on a manufactured gas 

(called span gas) in which the CO con- 

centration is very precisely controlled at 

70 ppm. If the readings suggest that the 

spectrophotometer is not working properly, 

it will have to be recalibrated. Assume that 
if it is properly calibrated, measured con- 
centration for span gas samples is normally 
distributed. On the basis of the six readings 

—85, 77, 82, 68, 72, and 69—is recali- 

bration necessary? Carry out a test of the 

relevant hypotheses using the P-value 

approach with « = .05. 
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63. The relative conductivity of a semicon- from a request to test the appropriate 
ductor device is determined by the amount hypotheses. 
of impurity “doped” into the device during 
its manufacture. A silicon diode to be used N Mean Std Dev T Prob >|T| 


; ; 15 0.6453333 0.0899100 1.9527887 0.0711 
for a specific purpose requires an average 


cut-on voltage of .60 V, and if this is not 
achieved, the amount of impurity must be 
adjusted. A sample of diodes was selected 
and the cut-on voltage was determined. 
The accompanying SAS output resulted 


[Note: By default, SAS’s P-value is for a 
two-tailed test.] What would be concluded 
for a significance level of .01? .05? .10? 


9.5 The Neyman-Pearson Lemma and Likelihood Ratio Tests 


The test procedures presented thus far are (hopefully) intuitively reasonable, but have not been shown 
to be “best” in any sense. How can an optimal test be obtained, one for which the type II error 
probability is as small as possible, subject to controlling the type I error probability at the desired 
level? 


Simple Hypotheses 

Our starting point here will be a rather unrealistic situation from a practical viewpoint: testing a simple 
null hypothesis against a simple alternative hypothesis. A simple hypothesis is one which, when true, 
completely specifies the distribution of the sample X;’s. Suppose, for example, that X;,...,X,, form a 
random sample from an exponential distribution with parameter 2. Then the hypothesis H: 2 = 5 is 
simple, since when H is true each X; has an exponential distribution with parameter 2 = 5. We might 
then consider Hy: 4 = 5 versus H,: A = 10, both of which are simple hypotheses. The hypothesis 
H,: 2 < 5 is not simple, because when H, is true, the distribution of each X; might be exponential with 
2=4 or with 2 = 2.8 or .... 

Similarly, if the X,’s constitute a random sample from a normal distribution with known o, then 
H: uw = 100 is a simple hypothesis. But if the value of o is unknown, this hypothesis is not simple 
because the distribution of each X; is not completely specified; it could be N(100, 15) or N(100, 12) or 
N(100, o) for any other positive value of o. For a hypothesis to be simple, the value of every 
parameter in the pmf or pdf of the X;’s must be specified. 

Throughout this chapter we have always employed composite (that is, not simple) alternative 
hypotheses. In practice, a pair of simple hypotheses such as Ho: 2 = 5 versus H,: 1 = 10 are almost 
never tested, since they imply that no other 1 value is possible (what if both are false because 
2 = 7.67). However, when hypothesis testing was developed about a century ago, early statistical 
pioneers developed optimal methods for a pair of simple hypotheses and then built up from that 
foundation. 


The Neyman-Pearson Lemma 

The next result was a milestone in the theory of hypothesis testing—a method for constructing a best 
test for a pair of simple hypotheses. Let f(x1,..., Xn; 0) be the joint pmf or pdf of the X;’s. Our simple 
null hypothesis will assert that 0 = 09 and the simple alternative hypothesis will claim that 0 = 0). 
The result carries over to the case of more than one parameter as long as the value of each parameter 
is completely specified in both Hp and Hj. 
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THE NEYMAN- For testing a simple null hypothesis Ho: @ = Qo versus a simple alternative 
PEARSON LEMMA hypothesis H,: 0 = 0,, let k be a fixed positive number and form the 
rejection region 


oe Thies aa) 


Let «* = P(X), ..., X,) € R* when @ = Oo), the probability of a type I error 
using R*, and let $* denote the type II error probability (i.e., the probability 
that the X;’s lie in the complement of R* when @ = 0,). 

Then for any other test procedure with type I error probability «& satisfying 
a < «*, the probability of a type II error must satisfy 6 > f*. That is, the 
test with rejection region R* has the smallest type II error probability among 
all tests for which the type I error probability is at most «*. 


The test statistic value in (9.6) is called a likelihood ratio—it’s the ratio of the alternative likelihood to 
the null likelihood. We’ll explore likelihood ratio tests more deeply later in this section. As in 
previous sections of this chapter, the constant k in the rejection region is tied to the type I error 
probability «*. In the continuous case, k can be selected to give one of the traditional significance 
levels (.05, .01, and so on), whereas in the discrete case «* = .057 or .039 may be as close as one can 
get to .05. 

Roughly speaking, the Neyman—Pearson Lemma prescribes, subject to a given significance level, 
the test procedure that minimizes the chance of committing a type II error. Equivalently, it maximizes 
the power of the hypothesis test—that is, R* in (9.6) defines the most powerful test of the simple 
hypotheses Ho: 0 = 09 versus H,: 0 = 0, at its level of significance. 


Example 9.23 As part of quality control at a semiconductor plant, consider randomly selecting n = 5 
newly-made integrated circuits of a certain type and determining the number of defects on each one. 
Let X; denote the number of such defects for the ith selected circuit (i = 1, ..., 5), and suppose that the 
X;s form a random sample from a Poisson distribution with parameter yu. Let’s find the best test for 
testing Ho: = 1 versus H,: = 2. The Poisson likelihood is f(x,,..., x53) = e* pw? /TIx;!. 
Substituting first 4 = 2, then w = 1, and then taking the ratio of these two likelihoods as in (9.6) gives 
the rejection region 


R= {(x1, weak) eo >k} 


Multiplying both sides of the inequality by e° and taking a logarithm allows us to re-write the 
rejection region as )\ x; > c, where c = In(ke? y/In(2). 

This latter rejection region is completely equivalent to R*: for any particular value k there will be a 
corresponding value c, and vice versa. But it is much easier to express the rejection region in this 
latter form and then select c to obtain a desired significance level than it is to determine an appropriate 
value of k for the likelihood ratio. In particular, the rv Y = )~X; has a Poisson distribution with 
parameter 5 (via a moment generating function argument), so when Hp is true Y ~ Poisson(5). If 
we use c = 10 in the rejection region, then from Table A.2 


o.* = P(Y > 10 when Y ~ Poisson(5)) = 1 — .968 = .032 
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Choosing instead c = 9 gives a* = .068. If we insist that the significance level be at most .05, then the 
optimal rejection region is R* = {(x,,...,x%5) : >> x; > 10}, and «* = P(type I error) = .032. 
When H, is true, the test statistic Y has a Poisson distribution with parameter 5(2) = 10. Thus 


B* = P(Ap is not rejected when H, is true) 
= P(Y <10 when Y ~ Poisson(10)) = .458 


The Neyman—Pearson Lemma guarantees that any other test procedure based on these 5 observations, 
provided its type I error probability is < .032, must necessarily have a type II error probability 
greater than or equal to .458. Equivalently, every test in this situation with « < .032 has power no 
better than 1 — .458 = .542; to increase power here would require increasing «*. 

Obviously the type II error probability here is quite large (and the power rather low). This is 
because the sample size n = 5 is too small to allow for effective discrimination between p = 1 and 
fi = 2. For a sample size of 10, the best test having significance level at most .05 uses c = 16, for 
which «* = .049 (Poisson parameter = 10) and f* = .157 (Poisson parameter = 20). 

Finally, returning to a sample size of n = 5, c = 10 implies that 10 = In(ke?)/In(2), from which 
k = 2'"/e° = 6.9. For the best test to have a significance level of at most .05, the null hypothesis 
should be rejected only when the likelihood for the alternative value of is more than about 7 times 
what it is for the null value. | 


Example 9.24 Let X,, ..., X,, be a random sample from a normal distribution with mean jy and 
variance 1; the argument to be presented will work for any other known value of o. Consider testing 
Ho: = Mo versus Hy: = My where [a > Mo, The likelihood ratio in (9.6) is 


A) "/? 6 (1/2)E(3;—H)? 
2n é b _ o(Ha Ho) Exi—(n/2) (42-15) 
("e192 Ho)? 


= fern(vt-18)/2) . [eter v0) 2] 


The term in the first set of brackets is a numerical constant. Then 4, — fo > 0 implies that the 
likelihood ratio will be at least k if and only if Sox; > k’ for some X’, that is, if and only if x > k” 
for some k", which means if and only if 


ces! =Cc 


A/a ~ 


for some c. When Ho is true, the rv Z has a standard normal distribution (because o = 1; again, this argument 
works for any o). If we now let c = Zo, = 2.33, then «* = P(Z > c) = 01. By the Neyman—Pearson 
Lemma, our old friend the one-sample z test has minimum f/f among all tests for which « < .01. | 


z 


Proof of the Neyman-Pearson Lemma We shall consider the case in which the X;’s have a 
discrete distribution, so that type I and type II error probabilities are obtained by summation. In the 
continuous case, integration replaces summation. Let R denote the rejection region of any test 
procedure based on the X;’s, so that 


a = P((X1,...,Xn) € R when 0 = 60) = Sf ey) 
R 


B = P((X1,....Xn) € RY when 6 = 6,) =1—S°f(x1,.. 4503 9) 
R 
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(f is the probability outside the rejection region R, and the complement rule has been applied). Next, 
let k > O be any constant, and consider the linear combination ka + : 


ka+B=kS~f(x1,..-.%n} 00) + 1 — SF (a1... :Xn3 Oa) 
R R 
=1+ S- [kf (x1,---5%n3 00) —f (a1, ---sXnj Oa)] 
R 


The expression in brackets can be positive or negative. Now comes the clever part: among all possible 
test procedures, kx+ 6 is minimized by choosing R to be exactly the set where the expression in 
brackets is negative (or zero). That is, kx + f is minimized by using the rejection region 


Fah ag eels) >}, 


{ (14 50) f(t ini OD) F815 «An Oa) SO} = {( 9%) BC 6b) 


which is precisely R* from (9.6). 

With «* and f* defined as in the statement of the Neyman—Pearson Lemma, what we have 
established is that using R* minimizes ka + f, i.e., that ka* + B* < ka+f for all other choices of 
rejection region. In particular, for all test procedures satisfying « < «*, « — «* <0, and so for these 
test procedures 


ka" + B* Skat B= P< B+k(a—o*)<Pp+0= 8 


Thus we have shown that 6* < f for all such procedures, as desired. o 


An essentially identical argument shows that the same rejection region (9.6) can be used to 
minimize the chance of a type I error, subject to a constraint on the type II error probability. That is, 
with the same notation as above, the chance of a type lerroris > «* for all test procedures for which 
P(type I error) < f*. 


Power and Uniformly Most Powerful Tests 

The Neyman—Pearson Lemma identifies the most powerful test procedure when both hypotheses are 
simple. Next consider the more realistic scenario where one or both of the hypotheses are composite. 
In previous sections, the term power was primarily used when H, was true (the chance of correctly 
rejecting Ho). The following definition generalizes this idea. 


DEFINITION § Let Q_ and Q, be two disjoint sets of possible values of 0, and consider testing 
Ho: 8 € Qo versus H,: 0 € Q, using a test with rejection region R. Then the power 
function of the test, denoted by 7(-), is the probability of rejecting Hy considered as a 
function of 0: 


n(0') = P((X,...,Xn) € Rwhen 0 = 0’) 
The power function is easily related to the type I and type II error probabilities: 


n(6') = P(type I error when 0 = 0’) = «(6’) when 6’ € Qo 
1 — P(type II error when 0 = 0’) = 1— B(0') when 0’ € Q, 
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Since we don’t want to reject the null hypothesis when 6 € Qo and do want to reject it when 0 € Q,, 
we desire a test for which the power function is close to 0 whenever 6’ is in Qo and close to 1 
whenever 0’ is in Q,. The ideal power function, though not achievable in practice, is 


1 [0 whend! € Qo 
n(0) = 44 when 0’ € Q, 


Example 9.25 The drying time (min) of a particular paint on a test board under controlled conditions 
is known to be normally distributed with w = 75 and o = 9. A new additive has been developed for 
the purpose of improving drying time. Assume that drying time with the additive is still normally 
distributed with the same standard deviation, and consider testing Ho: “¢ > 75 versus H,: uw < 75 
based on a sample of size n = 100. A test with significance level .10 rejects Ho if z < -z19 = —1.28, 
where z = (x — 75) /(9/\/100) = (x — 75) /.9. Manipulating the inequality in the rejection region to 
isolate x gives the equivalent rejection region x < 73.848. 

If w = w’, then X has a normal distribution with mean yp’ and standard deviation ¢/,/n = .9. Thus 
the power function of the test is 


3 73.848 — yl 
r(ul) = PRS 73.848 when w=) = (BES “) 


9 


The ideal power function for these hypotheses equals 0 for ~ > 75 (Ho is true) and equals 1 for 
ui < 75 (H, is true). Figure 9.11 shows both the actual power function z(y’) and the ideal function. 
The maximum power for « > 75 (.e., in Qo) occurs at pp = 75, on the boundary between Qo and Q,; 
specifically, z(75) = .10 = « by design. Because the power function is continuous, there are values of 
Lt smaller than 75 for which the power is quite small (barely above .10). Even with a large sample 
size, it is difficult to detect a very small departure from the null hypothesis. But as increases, the 
actual power function will approach the ideal. 


Power 
LQ 5S aero e eee ---- ideal 
— actual 
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Figure 9.11 Graphs of power functions for Example 9.25 a 
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The Neyman—Pearson lemma says that when Qp consists of a single value 09 and Q, also consists 
of a single value 0,, the rejection region R* in (9.6) specifies a test for which the power 7(6,) at the 
alternative value 0, is maximized subject to 2(09) < « for some specified value of «. That is, R* 
specifies a most powerful test subject to the restriction on the power when the null hypothesis is true. 
What about best tests when at least one of the two hypotheses is composite? 


Example 9.26 (Example 9.23 continued) Consider again a random sample of size n = 5 from a 
Poisson distribution, and suppose we now wish to test Hp: « < 1 versus H,: p > 1. Both of these 
hypotheses are composite. Arguing as in Example 9.23, for any value pz, exceeding 1, the most 
powerful test of Ho: x = 1 versus H,: “ = Ua with significance level equal to .032 (i.e., z(1) = .032) 
rejects the null hypothesis when }*> x; > 10. Furthermore, it is easily verified that the 2(j1') < .032 
for np’ < 1. 

Thus the test that rejects Ho: uw < 1 in favor of Hp: uw > 1 when 5) x;> 10 has maximum power 
for any wl = Ll, > 1, subject to the condition that m(u') < (1) = .032 whenever yw’ < 1. This test is 
uniformly most powerful. ] 


More generally, a uniformly most powerful (UMP) level « test is one for which 2(0’) is max- 
imized for every 0’ € Q, subject to 2(0’) < « for 0’ € Qo. Unfortunately UMP tests are fairly rare, 
especially in commonly encountered situations when Ho and H, are assertions about a single 
parameter 0 while the distribution of the X;’s involves at least one other “nuisance parameter.” For 
example, when the population distribution is normal with values of both w and o unknown, o is a 
nuisance parameter when testing Ho: 6 = Mo versus H,: 4 ~ fo. Be careful here—the null hypothesis 
is not simple, because Qp consists of all pairs (4, o) for which fp = Wo and o > O, and there is certainly 
more than one such pair. In this situation, the one-sample f test is not UMP. 

However, suppose we restrict attention to unbiased tests, those for which the smallest value of 
m(0’) for 0’ € Q, is at least as large as the largest value of 2(0’) for 0’ € Qo. Unbiasedness simply says 
that we are at least as likely to reject the null hypothesis when Hp is false as we are to reject it when 
Hp is true. The test proposed in Example 9.25 involving paint drying times is unbiased because, as 
Figure 9.11 shows, the power function at or to the right of 75 is smaller than it is to the left of 75. It 
can be shown that the one-sample f test is UMP unbiased; that is, it is uniformly most powerful 
among all tests that are unbiased. Several other commonly used tests also have this property. Please 
consult the references by Casella and Berger or DeGroot and Schervish for more details on UMP 
tests. 


Likelihood Ratio Tests 
The likelihood ratio principle, described below, is a frequently used method for finding an appro- 
priate test statistic in a new situation. As before, denote the joint pmf or pdf of Xj, ..., X,, by 
f(*1,---;%n; 0). In the case of a random sample, it will be a product f(x1; 0).....f(%n; 0). As in the 
development of maximum likelihood estimates, when f(x1,...,Xn; 0) is regarded as a function of 0, it 
is called the likelihood function and is sometimes denoted L(0). 

Again consider testing Ho: 8 € Qo versus H,: 8 € Q,, where Qo and Q, are disjoint sets, and let 
Q = Qo UQ,. The set Q is called the parameter space, since it represents all possible values of the 
parameter 0 under consideration. In the Neyman—Pearson Lemma, the test statistic is the ratio of the 
likelihood when @ € Q, = {0,} to the likelihood when 8 € Qo = {0}, rejecting Hp when the value of 
the ratio is “sufficiently large.” For one or more composite hypotheses, we instead consider the ratio 
of the likelihood when @ € Qo to the likelihood when 0 € Q; the latter effectively puts no constraints 
on the value of 0. A very small value of this ratio argues against the null hypothesis, since a small 
value arises when the data is much more consistent with H, than with Hp. More formally, 
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ee 


. Find the largest value of L(0) for 0 € Q by finding the maximum likelihood estimate of 0; denote 

this estimate by Omnles Substitute this mle into the likelihood function to obtain L(Omte). 

2. Find the largest value of L(@) for 0 € Qo by finding the maximum likelihood estimate of 0 within 
Qp; denote this estimate by Bo. Substitute this restricted mle into the likelihood function to obtain 
L(8p). 

Because Qo is a subset of Q, this restricted likelihood L(00) can’t be any larger than the likelihood 
L(Omnte) obtained in the first step, and will be much smaller when the data is much more consistent 
with H, than with Hp. 

3. Form the likelihood ratio test statistic 


) f (x1, « «+5 %nj 0) 


co : 
L( inte) f (X1, --+;Xnj Omte) 


and reject the null hypothesis in favor of the alternative when this ratio is < k. The critical value 

k is chosen to give a test with the desired significance level. In practice, the inequality A <k is 

often re-expressed in terms of a more convenient statistic (such as the sum or mean of the 

observations) whose distribution is known or can be derived. 

The above prescription, called a likelihood ratio test, remains valid if the single parameter 0 is 
replaced by several parameters 0, ..., 0,,. The mles of all parameters must be obtained in both steps 1 
and 2 and substituted back into the likelihood function. 


Example 9.27 Consider a random sample from a normal distribution with the values of both 
parameters unknown. We wish to test Ho: 4 = [lg versus H,: 6 # Uo. Here Q consists of all values of 
Hand o for which —oo < ut < oo and o > 0, and the likelihood function is 


n/2 . 
L = —1/ (20) Yi) 
a)= (525) ¢ 
In Section 7.2 we obtained the mles as jinje =X, G22 = Y> (xi — X)°/n. Substituting these estimates 


back into the likelihood function gives 


1 n/2 
L [mies Omle SU il ee = gr 
eset) bee (x; =n) 


Within Qo, yu in the foregoing likelihood is replaced by fo, so that only o must be estimated. More 
precisely, the mle of yz subject to the constraint fs = fo is trivially jig = Lp. It is easily verified that the 
other mle under Qo is 62 = > (x; — )”/n. Substitution of this estimate in the likelihood function 
yields 


1 n/2 
L(fip, 60) = +++ = en? 
ee xccarn) 


Thus we reject Ho in favor of H, when 
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piles SEO oc Soe (= ie | k 


EC iaids Omle ) 
Raising both sides of this inequality to the power 2/n, we reject Hy whenever 


eee Ae 
= = a aac 


This is intuitively reasonable: the value jo is implausible for py if the sum of squared deviations about 
the sample mean is much smaller than the sum of squared deviations about [Wo. 
The denominator of this latter ratio can be expressed as 


dol — 3) + &— wo)? = YF 2) +27 — Ho) (i — 3) +(F — Ho)” 


The middle (i.e., cross-product) term in this expression is 0, because the constant x — fg can be 
moved outside the summation, and then the sum of deviations from the sample mean is 0. Thus we 
should reject Hy when 


eet a 
T= HP mE py) TMG wo)" /D (HHP 


This latter ratio will be small when the second term in the denominator is large, so the condition for 
rejection becomes 


n(X — Mo)” 
Y (x — 8)" 


Dividing both sides by n — 1 and taking square roots gives the rejection region 


a 


X= Ho Mae a1 6 


s//n s/n ~ 


If we now let c = t,/2,,-1, we have exactly the two-tailed one-sample f test! 


either 


>c or 


The bottom line is that when testing Ho: pu = Uo against the two-sided (+) alternative, the one- 
sample ¢ test is the likelihood ratio test. This is also true of the upper-tailed version of the ¢ test when 
the alternative is H,: 6 > Uo and of the lower-tailed test when the alternative is Hj: bp < Uo. We could 
trace back through the argument to recover the critical constant k from c, but there is no point in doing 
this; the rejection region in terms of ¢ is much more convenient than the rejection region in terms of 
the original likelihood ratio. a 


A number of tests discussed subsequently in this book, including the “pooled” ¢ test from the next 
chapter and various tests from ANOVA (the analysis of variance) and regression analysis, can be 
derived by the likelihood ratio principle. 

In many situations, the inequality for the rejection region of a likelihood ratio test cannot be 
manipulated to express the test procedure in terms of a simple statistic whose distribution can 
be ascertained. The following large-sample result, valid under fairly general conditions, can then be 
used. 
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THEOREM If the sample size n is sufficiently large, then the statistic —2 In(A) has approximately a 
chi-squared distribution with v degrees of freedom when Hp is true, where v is the 
difference between the number of “freely varying” parameters in Q and the number of 
such parameters in Qo. 


For example, if the distribution sampled is bivariate normal with the 5 parameters f), U2 0) o2, and 
p and the null hypothesis asserts that w; = fz and o; = op, then v=5 —3 =2. 

By its definition 0< A< 1, and the likelihood ratio test rejects Hg when this likelihood ratio is 
much less than 1. This is equivalent to rejecting Hy when —2 In(A) is large and positive. The large- 
sample version of the test described in the theorem is thus upper-tailed: Ho should be rejected if 
—2In(A) > 72, an upper-tail critical value extracted from Table A.5. 


Example 9.28 Suppose a scientist makes n measurements of some physical characteristic, such as 
the specific gravity of a liquid. Let Xj, ..., X,, denote the resulting measurement errors. Assume that 
these X;’s are independent and identically distributed according to the double exponential (Laplace) 
distribution: f(x) = .Se~"~"l for —oo <x < oo. This pdf is symmetric about @ with somewhat heavier 
tails than the normal pdf. If 0 = 0 then the measurements are unbiased, so it is natural to test 
Ho: 0 = 0 versus H,: 0 ~ 0. Here v = 1 — 0 = 1. The likelihood is 


L(8) = (.5)"e 2A 


Because of the minus sign preceding the summation, the likelihood is maximized when > |x; — 0] is 
minimized. The absolute value function is not differentiable, and therefore differential calculus cannot 
be used. Instead, consider for a moment the case n = 5 and let y; < --- <ys5 denote the ordered values 
of the x;’s. For example, suppose a random sample of size 5 from the Laplace distribution with 0 = 0 
is —.24998, .75446, —.19053, 1.16237, .83229, so (1, ..., Ys) = (—.24998, —.19053, .75446, .83229, 
1.16237). Then 


yityo2+y3+y4+ys5 — 50 O<y, 
—y +y2t+y3+yat+ys—30 yy <O0<y2 
—yi-—yoty3+yaty5s-O yo <O0<y3 

x,-O\— -_@— yi — y2 
PP a 2) a —yi-—y2-Y3s+yatystO y3<O<y4 
—y; —y2 —y3 —yatys +30 yy <O0<y5 
Yi—y2—-y3—-Ya-Yst50 O>Yys 


The graph of this expression as a function of 0 appears in Figure 9.12 (p. 551), from which it is 
apparent that the minimum occurs at y3 = x = .75446, the sample median. (The situation is similar 
whenever n is odd. When n is even, the function achieves its minimum for any 6 between y,,» and 
Yni2)41; ONE such O is (Yn/2 +Y(n/2)+1)/2 =X. In summary, the mle of @ is the sample median.) 

The likelihood ratio statistic for testing the relevant hypotheses is A = (.5)"e~*!!/[(.5)"e7=h—4l], 
Simplifying and computing —2In(A) gives the rejection region 2 > |x;| — 2 > |x; — X| > x3, for the 
large-sample version of the likelihood ratio test. 

Suppose that a sample of n = 30 errors results in )> |x;| = 38.6 and >> |x; — x| = 37.3. Then 


~2In(A) = 2(- iy be x) 9:6 
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Figure 9.12 Determining the mle of the double exponential parameter by minimizing )~ |x; — 0| 


Comparing this to rare = 3.84, we would not reject the null hypothesis at the 5% significance level. 


It is plausible that the measurement process indeed has mean/median 0, as desired. a 


Exercises: Section 9.5 (64—74) 


64. For a random sample of n individuals tak- approximate significance level of the 


ing a licensing exam, let X; = 1 if the ith 
individual in the sample passes the exam 
and X; = 0 otherwise (i = 1, ..., 7). 


a. With p denoting the proportion of all 
exam-takers who pass, show that the 
most powerful test of Ho: p = .5 versus 
H,: p = .75 rejects Hy when >> x; >c. 

b. Ifnm = 20 and you want « < .05 for the 
test of (a), would you reject Ho if 15 of 
the 20 individuals in the sample pass the 
exam? 

c. What is the power of the test you used 
in (b) when p=.75 [i.e., what is 
m(.75)]? 

d. Is the test derived in (a) UMP for testing 
the hypotheses Ho: p= .5 versus 
Hi: p > .5? Explain your reasoning. 

e. Graph the power function x(p) of the 
test for the hypotheses of (d) when 
n=20 anda < .05. 

f. Return to the scenario of (a), and sup- 
pose the test is based on a sample size of 
50. If the probability of a type II error is 
approximately .025, what is_ the 


65. 


66. 


test (use a normal approximation)? 


The error X in a measurement has a normal 
distribution with mean value 0 and variance 
o°. Consider testing Hp: o° =2 versus 
H,: o =3 based on a random sample 


X,, ..., X, of errors. 


a. Show that a most powerful test rejects 
Ho when >> x? >c. 

b. For n = 10, find the value of c for the 
test in (a) that results in « = .05. 

c. Is the test of (a) UMP for Ho: a =2 
versus H,: 0° > 2? Justify your assertion. 


Suppose that X, the fraction of a container 

that is filled, has pdf f(x;0) = 0x°! for 

0 <x < 1 (where @ > 0), and let X;, ..., X, 

be a random sample from this distribution. 

a. Show that the most powerful test for 
Ho: 0 = 1 versus H,: 0 = 2 rejects the 
null hypothesis if }“In(x;) > c. 

b. Is the test of (a) UMP for testing 
Ho: 0=1 versus H,: 0 > 1? Explain 
your reasoning. 
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67. 


68. 


69. 


c. If n=50, what is the (approximate) 
value of c for which the test has sig- 
nificance level .05? 


Consider a random sample of n component 
lifetimes, where the distribution of lifetime 
is exponential with parameter 2. 


a. Obtain a most powerful test for 
Hp: 4=1~ versus Hy: A=.5, and 
express the rejection region in terms of a 
“simple” statistic. 

b. Is the test of (a) uniformly most pow- 
erful for Ho: 2 = 1 versus Hy: 2 < 1? 
Justify your answer. 


Consider a random sample of size n from 
the “shifted exponential” distribution with 
pdf f(x;0)=e"@-® for x>6 and 0 
otherwise (the graph is that of the ordinary 
exponential pdf with 1 = 1 shifted so that it 
begins its descent at @ rather than at 0). Let 
Y, denote the smallest order statistic, and 
show that the likelihood ratio test of 
Ho: 0 < 1 versus H,: 0 > 1 rejects the null 
hypothesis if y,, the observed value of Yj, 
is > c. 


Suppose that each of n randomly selected 
individuals is classified according to his/her 
genotype with respect to a particular 
genetic characteristic and that the three 
possible genotypes are AA, Aa, and aa with 
long-run proportions (probabilities) 07, 
2011-6), and (1-0), respectively 
(0<@< 1). It is then straightforward to 
show that the likelihood is 
e™ . [20(1 — 6)? - (1-0) 

where x), X2, and x3 are the number of 
individuals in the sample who have the AA, 
Aa, and aa genotypes, respectively. Show 
that the most powerful test for testing 
Ho: 0=.5 versus H,: 0 = .8 rejects the 
null hypothesis when 2x; + x. > c. Is this 
test UMP for the alternative H,: 0 > .5? 
Explain. [Note: The fact that the joint 
distribution of X,, X>, and X3 is multinomial 


70. 


71. 
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can be used to obtain the value of c that 
yields a test with any desired significance 
level when n is large.] 


The error in a measurement is normally 
distributed with mean pw and_ standard 
deviation 1. Consider a random sample of 
n errors, and show that the likelihood ratio 
test for Ho: up = 0 versus H,: uw # 0 rejects 
the null hypothesis when either x>c or 
x< —c. What is c for a test with « = .05? 
How does the test change if the standard 
deviation of an error is dg (known) and the 
relevant hypotheses are Ho: = Mo versus 


A: LM # Ho? 


Measurement error in a particular situation 
is normally distributed with mean value 
and standard deviation 4. Consider testing 
Ho: « =0 versus H,: « #0 based on a 
sample of n = 16 measurements. 


a. Verify that the usual test with signifi- 
cance level .05 rejects Ho if either 
x> 1.96 or x < —1.96. [Note: That this 
test is unbiased follows from the fact 
that the way to capture the largest area 
under the z curve above an interval 
having width 3.92 is to center that 
interval at 0 (so it extends from —1.96 to 
1.96).] 

b. Consider the test that rejects Ho if either 
x>2.17 or x<—1.81. What is «a, that 
is, 2(0)? 

c. What is the power of the test proposed 
in (b) when px = .1 and when p = —.1? 
(Note that .1 and —.1 are very close to 
the null value, so one would not expect 
large power for such values.) Is the test 
unbiased? 

d. Calculate the power of the usual test 
when uw = .1 and when pw = —.1. Is the 
usual test a most powerful test? [Hint: 
Refer to your calculations in (c).] [Note: 
It can be shown that the usual test is 
most powerful among all unbiased 
tests.] 
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72. A test of whether a coin is fair will be based 73. Reconsider the one-sample ¢ test of Exam- 


on n= 50 tosses. Let X be the resulting ple 9.27. 
number of heads. Consider two rejection a. With t = (%—Uo)/(s//n), show that 
regions: Ri = {x:eitherx < 17orx > 33} the likelihood ratio is equal to 
and Rj = {x: eitherx < 18orx > 37}. A=([142/(n— 1)", and therefore 
a. Determine the significance level (type I the approximate chi-square statistic is 
error probability) for each rejection —2In(A) = nInf1+2?/(n—1)). 
region. b. Apply part (a) to test the hypotheses of 
b. Determine the power of each test when Exercise 59, using the data given there. 
p = 49. Is the test with rejection region Compare your results with the answers 
R, a uniformly most powerful level .033 found in Exercise 59. 
test? Explain. 74. The test statistic in the Neyman—Pearson 
c. Is the test with rejection region Rj Lemma and the likelihood ratio test statistic 
unbiased? Explain. A are intimately related. Consider testing 
d. Sketch the power function for the test Ho: 0 = Oo versus Hy: 0 = 0, and let A* 
with rejection region Rj, and then do so denote the test statistic in (9.6). Show that 
for the test with the rejection region R>. 
What does your intuition suggest about ro { 1/ AY if L(09) < L(0q) 
the desirability of using the rejection — 1 otherwise 
region Rz? 


9.6 Further Aspects of Hypothesis Testing 


We close this chapter by briefly considering several additional aspects of hypothesis testing, including 
the distinction between statistical significance (rejecting Hp at a particular «) and the practical import 
of a departure from Hp, the relationship between tests and confidence intervals or bounds, and test 
procedures based on bootstrapping. 


Statistical Versus Practical Significance 

Although the process of reaching a decision by using the methodology of classical hypothesis testing 
involves selecting a level of significance and then rejecting or not rejecting Hp at that level, simply 
reporting the « used and the decision reached conveys little of the information contained in the sample 
data. Especially when the results of an experiment are to be communicated to a large audience, 
rejection of Ho at level .05 will be much more convincing if the observed value of the test statistic 
greatly exceeds the 5% critical value than if it barely exceeds that value. This is precisely what led to 
the notion of P-value as a way of reporting significance without imposing a particular « on others who 
might wish to draw their own conclusions. In fact, the editorial “Moving to a World Beyond 
‘p < 0.05’ ” (The American Statistician 2019) calls for researchers to always report their actual 
P-values, rather than just whether hypotheses were rejected at the .05 level, and some research 
journals have begun adopting this policy. 

Even if a P-value is included in a summary of results, however, there may be difficulty in 
interpreting this value and in making a decision. This is in part because a small P-value, which would 
ordinarily indicate statistical significance in that it would strongly suggest rejection of Ho in favor of 
H,, may be the result of a large sample size in combination with a departure from Hp that has little 
practical significance. In many experimental situations, only departures from Hp of large magnitude 
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would be worthy of detection, whereas a small departure from Hop would have little practical 
importance. The editorial cited above also recommends the abolishment of the phrase “statistically 
significant” precisely because of this confusion. 

Consider as an example testing Hp: 1 = 100 versus H,: p > 100 where y is the mean of a normal 
population with o = 10. Suppose a true value of yw = 101 would not represent a serious departure 
from Ho, in the sense that not rejecting Hp when y = 101 would be a relatively inconsequential error; 
this would be the case, for example, if 4 represented the average IQ score within some population. For 
a reasonably large sample size n this would lead to an x value near 101, so we would not want this 
sample evidence to argue strongly for rejection of Hyp when x = 101 is observed. For various sample 
sizes, Table 9.1 records both the P-value when x = 101 and also the probability of not rejecting Ho at 
level .01 when pu = 101. 


Table 9.1 An illustration of the effect of sample size on P-values and f 


n P-value when x = 101 B(101) for level .O1 test 
25 3085 .9664 
100 1587 .9082 
400 .0228 .6293 
900 .0013 2514 
1600 .0000335 0475 
2500 .000000297 .0038 
10,000 7.69 x 10°74 .0000 


The second column in Table 9.1 shows that even for moderately large sample sizes, the P-value of 
x = 101 argues very strongly for rejection of Hy whereas the observed x itself suggests that in 
practical terms the true value of yw differs little from the null value jig = 100. The third column points 
out that even when there is little practical difference between the true yw (101) and the null value (100), 
for a fixed « a large sample size will almost always lead to rejection of the null hypothesis at that 
level. To summarize, one must be especially careful in interpreting evidence when the sample size is 
very large, since any small departure from Hp will almost surely be detected by a test, yet such a 
departure may have little practical significance. 


The Relationship Between Confidence Intervals and Hypothesis Tests 
A confidence interval (Chapter 8) specifies a range of plausible values for an unknown population 
parameter. In contrast, the test procedures of this chapter focused on deciding whether a parameter 
equals a particular specified value. Not surprisingly, these two statistical inference methods are related 
and, in general, will yield consistent conclusions about a parameter when based on the same sample. 
Consider again a hypothesis test for a population mean pw of the form Ho: pu = 100 versus 
H,: uw # 100. Rather than following the techniques of this chapter, what if we constructed a confi- 
dence interval for y instead? If 100 is within this confidence interval, then 100 is a plausible value of 
Lt; hence, we should not reject the claim that equals 100 (i.e., don’t reject Ho). Conversely, if 100 
falls outside the confidence interval for 4, then 100 is not a plausible value for yz, and we should reject 
the hypothesis 4 = 100 in favor of the alternative 4 # 100. More generally, for Ho: 4 = Uo versus 
Hy: UF Mo, we reject Ho if and only if Uo falls outside a confidence interval for w. 
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Two important mathematical connections need to be made here. First, in the preceding scenario, 
both the confidence interval and the alternative hypothesis were “two-sided.” This is not coincidence: 
suppose instead that we wanted to decide between the claims Ho: & = Uo and Ha: L < flo. We would 
only decide in favor of H, if the data provided convincing evidence that y is lower than pio. This 
suggests computing an upper confidence bound for jw: if we can say with confidence that is at most 
some value B, and B is less than ,io, then the data provides convincing evidence that 1 is also less than 
Ho. On the other hand, if uo is less than B, then the confidence statement “yw < B” doesn’t tell us 
whether y is lower than fo or not. Hence, we would not be comfortable rejecting Ho: “ = Mo in favor 
of the alternative H,:  < Lo. By the same reasoning, testing Ho: “ = Mo against the “upper-tailed” 
alternative H,: [4 > Ug is equivalent to computing a /ower confidence bound for ~ and observing 
whether jg falls below that bound. 

Second, any interval estimate carries with it an associated level of confidence (e.g., 95%), and 
every hypothesis test is carried out at a specified level of significance (e.g., 5%). A hypothesis test at 
significance level a is equivalent to the appropriate confidence interval/bound at confidence level 
100(. — «)%. This should seem intuitively reasonable, but a mathematical demonstration can also be 
given (Exercise 80). 


Example 9.29 Refer back to Example 9.12, in which data was provided on the ds) value (a measure 
of particulate matter size) for n = 9 roadside assays performed near Black Mountain, NC. Using the 
summary statistics x = 68.52 microns, s = 20.49, and the ¢ critical value too5.g = 3.355, a 99% CI for 
the true mean is 


20.4 
68.52 + 3.355 - oe = 68.52 + 22.91 = (45.61, 91.43) 


v9 


Because the interval does not include the value 44 microns, we can reject Ho: uw = 44 in favor of 
H,: u # 44 at the .01 level of significance. The significance level « = .01 of the hypothesis test aligns 
with the selected confidence level: 99% = 100(1 — .01)%. 

If the researchers were instead interested in testing Hp: u = 44 versus H,: pp > 44, then a lower 
confidence bound for 4 would be required, and Ho would be rejected if that lower confidence bound 
exceeded 44 microns (since it would then follow that p also exceeds 44). | 


Some caution must be taken when applying this notion of “duality” between intervals and tests to a 
population proportion p. This is because the standard deviation of P is estimated differently for 
confidence intervals and hypothesis tests: ,/pq/n for the former, ,/pogo/n for the latter. Hence, it is 
possible (though uncommon, especially for larger sample sizes) to get mutually contradictory con- 
clusions about a hypothesized value po when comparing a hypothesis test to the corresponding 
confidence interval. 


General Large-Sample z Tests 

The large-sample tests for p presented in Section 9.3 are a special case of more general large-sample 
procedures for a parameter 0. Let 0 be an estimator of @ that is at least approximately unbiased and 
has approximately a normal distribution. (Recall that, under very general conditions, maximum 
likelihood estimators have both of these properties.) The null hypothesis has the form Ho: 0 = Oo, 
where 09 denotes a number (the null value) appropriate to the problem context. A large-sample test 


statistic results from standardizing 0 under the assumption that Hp is true [so that E(0) = Oo): 
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6 — 4% 


% 


Test statistic: Z = 


If the alternative hypothesis is H,: 0 > 09, an upper-tailed test whose significance level is approxi- 
mately « is specified by the rejection region z > z,. The other two alternatives, H,: 0 < 09 and 
H,: 0 # Oo, are tested using a lower-tailed z test and a two-tailed z test, respectively. 


In some cases, when Ho is true the standard deviation of 0, G4, involves no unknown parameters. 


For example, if 0 = w and 0=X, da = 0x = /,/n, which involves no unknown parameters if the 


value of o is known. In the case 0 = p, 6) = op = \/p(1 — p)/n, which involves the parameter of 
interest p itself. But cj does not involve any unknown parameters when Hp is true, because we simply 
substitute p = po into the standard error. When o, does involve unknown parameters, it is often 
possible to use an estimated standard deviation S$, in place of og and still have Z approximately 
normally distributed when Hp is true (because when n is large, sj ~ oj for most samples). The one- 
sample f test for large n furnishes an example of this: when o is unknown, we use Sj = Sy = S/,/n in 
place of o/,/n in the denominator of the test statistic (9.1), resulting in a ¢,_,-distributed statistic. 
When n is large, the ¢,_, and z distributions are virtually indistinguishable, and so the use of z-based 
rejection regions or P-values is not inappropriate in this situation. 


Bootstrap Hypothesis Testing for ~ 

The bootstrap technique was introduced in Chapter 8 as a way of producing interval estimates for 
parameters without making additional assumptions about the population (e.g., normality). Analogous 
methodology exists for testing hypotheses about an unknown parameter (here, ~) when the one- 
sample ¢ procedure described earlier is not applicable. Typically, this occurs when the sample size n is 
not large and the sample data is heavily skewed or otherwise indicate that population normality is not 
plausible. 

The fundamental bootstrap concepts from Section 8.5 carry over to the hypothesis testing situa- 
tion: first, a sample of data x),...,x, is obtained. To approximate the sampling distribution of a 
statistic (here, X), many resamples of size n are randomly selected with replacement from x), ..., Xn, 
and the statistic of interest is calculated for each resample. The distribution of those resample means 
X1,X5,---,Xp-—the bootstrap distribution of X—provides a reasonable approximation to the sampling 
distribution of X. Inferences about the population mean jy can then be made. 

Hypothesis testing introduces one wrinkle: we need information about the distribution of X when 
the null hypothesis Ho: = Uo is true. The linchpin of the basic bootstrap method is to treat the 
observed sample x,,...,x, aS a population from which resamples will be drawn; however, this 
“population” does not have mean fo. The mean of the original sample is of course the observed 
sample mean x, not fo. To address this issue, the sample data must be adjusted as follows: create new 
observations w1,...,Wn» by 


Wwie=x—-X+hup, i=l,...,n 


This action simply relocates the original sample data in order to have mean wp; plots of the x;’s and the 
w;s would be indistinguishable except for where they are centered. Now if we apply the basic 
bootstrap method to w1,...,Wn», the resulting resample means w], w3,...,W, provide a semblance of 
what the distribution of X would look like if Ho were true. 
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From this bootstrap distribution of w;’s, a bootstrap P-value can be obtained by determining what 
proportion of bootstrap means are at least as contradictory to Hp as the observed value of the test 
statistic, x. For example, if the alternative hypothesis is H,: 4 < fo, then the bootstrap P-value is the 
proportion of values among W},W3,..-,W, that are less than or equal to x, the sample mean of the 
original data. 


Example 9.30 As 3D printing increases in popularity, the accuracy and precision of 3D scanners 
have become ever more critical. The article “3D Scanning Automation for Die Casting Quality 
Control” (Die Casting Engr., May, 2017: 16-18) describes a study in which a scanner was used on 
the same complicated object 12 times. For each run, the “flatness” (a sort of tolerance for surface 
smoothness) was recorded, resulting in the following measurements (microns): 


23.50 22.73 23.63 23.50 23.16 23.61 
23.54 22.64 23.55 23.41 23.49 23.18 


Does the data provide convincing statistical evidence that the true mean flatness under these 
settings exceeds 23 microns? Let’s test the hypotheses Ho: u = 23 versus H,: > 23. Figure 9.13 
shows a normal probability plot of the data; its strongly nonlinear pattern indicates that the population 
distribution is very likely nonnormal. Since the sample size is small (7 = 12), a one-sample f test 
would not be appropriate. 


99 


Percent 
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o 


1 
22.50 22.75 23.00 23.25 23.50 23.75 
Flatness (microns) 


Figure 9.13 Normal probability plot of the flatness data in Example 9.30 


Instead, we proceed with a bootstrap hypothesis test as described previously. The mean of the 
sample data is x = 23.3283. An adjusted “population” w),...,wi2 is created by subtracting x from 
each x; and adding flo = 23: 
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wy = xX, —X+ Uy = 23.50 — 23.3283 + 23 = 23.1717, we = 22.73 — 23.3283 + 23 = 22.4017, 


and so on. (A quick check confirms that the mean of the w,’s is Mo = 23, as it should be.) Then 
bootstrap resampling is performed on the w,’s: take a sample of size 12 with replacement from 
W1,---,W12, calculate the resample mean, and repeat. Figure 9.14 shows the result of B = 10,000 
bootstrap resamples in R. 


Resample means 


Figure 9.14 Bootstrap distribution for Example 9.30 


The bootstrap distribution in Figure 9.14 is clearly skewed, validating the decision not to use a 
t test. The histogram in Figure 9.14 shows how the statistic X would be expected to behave across 
repeated samples of size n = 12 from the population if the null hypothesis is true and the population 
mean is 23. Because this is an upper-tailed test, the bootstrap P-value is the proportion of these 
bootstrap values that are at least as large as the real sample mean, x = 23.3283. As is evident from the 
histogram, this is an extremely low probability—in fact, zero of the 10,000 resampled means were as 
large as x. Thus our bootstrap P-value is 0, indicating that we should reject Ho at any significance 
level. The data makes it clear that the population mean flatness of 3D scans under these settings is 
greater than 23 microns. i 


Exercises: Section 9.6 (75-82) 


75. Consider the large-sample level .01 test in with a sample size of 40,000? Why or 
Section 9.3 for testing Hp: p = .2 against why not? 
H,: p > .2. 76. Reconsider the paint-drying problem dis- 
a. For the alternative value p = .21, com- cussed in Example 9.25. The hypotheses 
pute f(.21) for sample sizes n = 100, were Ho: 4. = 75 versus H,: pp < 75, with o 
2500, 10,000, 40,000, and 90,000. assumed to have value 9. Consider the 
b. For p=x/n=.21, determine the alternative value «= 74, which in the 
P-value when n = 100, 2500, 10,000, context of the problem would presumably 
and 40,000. not be a practically significant departure 
c. In most situations, would it be reason- from Ho. 


able to use a level .01 test in conjunction 
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77. When Xj, Xo, .. 


78. When Xi, Xo, ee 


a. For a level .01 test, compute f at this 
alternative for sample sizes n = 100, 
900, and 2500. 

b. If the observed value of X is x = 74, 
what can be said about the resulting 
P-value when n = 2500? Is the data 
statistically significant at any of the 
standard values of «? 

c. Would you really want to use a sample 
size of 2500 along with a level .01 test 
(disregarding the cost of such an 
experiment)? Explain. 

., X, are independent N(u, ¢) 

variables and n is large, the sample variance 

S? has approximately a normal distribution 

with E(S*) = o? and V(S?) = 20*/(n — 1). 

a. Consider testing Hp: o = do. Use the 
mean and variance provided to construct 
a test statistic that has an approximately 
standard normal distribution when Ho is 
true. 

b. A manufacturer of exercise weights 
previously employed a process for 
which the standard deviation of the 
actual mass of its 10-Ib. weights was 
.1 Ib. After improving the process, the 
manufacturers wished to test Hp: o = .1 
versus H,: o < .1, where o denotes the 
true standard deviation using the new 
process. A sample of 100 such weights 
has a sample standard deviation of 
.07 Ib. Use this information and the test 
statistic in part (a) to determine whether 
Ho should be rejected at the .05 level. 
[Note: Hypothesis testing for a popula- 
tion variance can also be based on the 
chi-squared distribution discussed in 
Section 8.4. See Exercises 98-99. ] 

., X, are independent Pois- 

son variables, each with parameter py, and 

n is large, the sample mean X has approxi- 

mately a normal distribution with E(X) = pu 

and V(X) = u/n. This implies that 


79. 


80. 
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has approximately a standard normal dis- 
tribution. For testing Ho: “= Lo, we can 
replace 4 by wo in the equation for Z to 
obtain a test statistic. This statistic is 
actually preferred to the one-sample ¢ statis- 
tic with denominator S/./n when the X;’s 
are Poisson because it is tailored explicitly 
to the Poisson assumption. If the number of 
requests for consulting received by a certain 
statistician during a 5-day work week has a 
Poisson distribution and the total number of 
consulting requests during a 36-week 
period is 160, does this suggest that the 
true average number of weekly requests 
exceeds 4.0? Test using « = .02. 


Consider the tip percentage data from 

Example 9.13. 

a. Use the summary statistics x = 17.986, 
S = 5.937, n = 70 and the t¢ critical value 
t.05,69 = 1.667 to construct a 95% lower 
confidence bound for the population 
mean tip percentage LU. 

b. Consider testing the hypotheses 
Ho: = 15 versus H,: wp > 15. Accord- 
ing to the bound in part (a), what is the 
rejection decision at the .05 level? 
Explain your reasoning. 

c. Can the lower confidence bound in part 
(a) be used to test Hp: = 15 versus 
H,: uw # 15 at the .05 level? Explain. 

d. Return to the upper-tailed alternative 
H,: « > 15. Does the lower confidence 
bound in part (a) prescribe a rejection 
decision at the .01 level? At the .10 
level? 


This exercise establishes the “duality” 
between confidence intervals/bounds and 
hypothesis tests for the one-sample t proce- 
dures. (Similar derivations apply to other 
inference methods.) 
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a. Consider the lower-tailed ¢ test of 
Ho: [LL = Mo versus H,: Lb < Lo. Show that 
the test statistic t= (X — Uo)/(s//n) 
falls in the level « rejection region if and 
only if Uo exceeds the one-sample ¢ up- 
per confidence bound for w with confi- 
dence level 100(1 — «)%. 

b. Next, consider the upper-tailed alterna- 
tive H,: > Mo. Show that the test 
statistic falls in the level « rejection 
region if and only if [po is less than the 
lower 100(1 — «)% confidence bound 
for LU. 

c. Finally, show the equivalency between 
the (two-sided) confidence interval for 1 
and the two-tailed one-sample ¢ test of 
Ho: = Mo versus Hy: tb ~ Lo. 

81. Use the bootstrap hypothesis-testing 
method described in this section to test 
Ho: w= 115 versus H,: w < 115 for the 
bagel data presented in Exercise 27. 


82. Use the bootstrap hypothesis-testing 
method described in this section to test 
Ao: = 1.5 versus H,: u 4 1.5 for the alco- 
hol content data presented in Exercise 22. 


Supplementary Exercises: (83-102) 


83. When a drug is recalled for safety concerns 
(e.g., too many people having serious 
adverse reactions), the pharmaceutical 
company making the drug can only re-issue 
it by convincing the FDA that the refor- 
mulated version of the drug is safer than the 
original version. 


a. In words, what are the null and alter- 
native hypotheses for this situation? 
[Hint: the FDA will not allow re- 
issuance unless they see convincing 
evidence of a safety improvement.] 

b. Describe the possible type I and type II 
errors in this scenario. 

c. Which of the two possible errors is 
worse, and why? On that basis, how 
should the FDA determine the « level 
for testing whether the reformulated 
drug is safer? 
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84. A sample of 50 lenses used in eyeglasses 
yields a sample mean thickness of 3.05 mm 
and a sample standard deviation of .34 mm. 
The desired true average thickness of such 
lenses is 3.20 mm. Does the data strongly 
suggest that the true average thickness of 
such lenses is something other than what is 
desired? Test using a = .05. 


85. In the previous exercise, suppose the 
experimenter had believed before collecting 
the data that the value of o was approxi- 
mately .30. If the experimenter wished the 
probability of a type II error to be .05 when 
Lt = 3.00, was a sample size of 50 unnec- 
essarily large? 


86. It is specified that a certain type of iron 
should contain .85 g of silicon per 100 g of 
iron (.85%). The silicon content of each of 
25 randomly selected iron specimens was 
determined, and the accompanying output 
resulted from a test of the appropriate 
hypotheses. 


Variable N St Dev SE T P 
Mean 


silcont 25 0.8880 0.1807 0.0361 1.05 0.30 


Mean 


a. What hypotheses were tested? 

b. What conclusion would be reached for a 
significance level of .05, and why? 
Answer the same question for a signif- 
icance level of .10. 


87. A hot-tub manufacturer advertises that with 
its heating equipment, a temperature of 
100 °F can be achieved in at most 15 min. 
A random sample of 32 tubs is selected, 
and the time necessary to achieve a 100 °F 
temperature is determined for each tub. The 
sample average time and sample standard 
deviation are 17.5 min and 2.2 min, 
respectively. Does this data cast doubt on 
the company’s claim? Compute the P-value 
and use it to reach a conclusion at level .05 
(assume that the heating-time distribution is 
approximately normal). 

88. The true average breaking strength of 
ceramic insulators of a certain type is sup- 
posed to be at least 10 psi. They will be 
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89. 


90. 


used for a particular application unless 
sample data indicates conclusively that this 
specification has not been met. A test of 
hypotheses using « = .01 is to be based on 
a random sample of ten insulators. Assume 
that the breaking-strength distribution is 
normal with unknown standard deviation. 
[Note: Software is required for this exer- 
cise. | 


a. If the true standard deviation is .80, how 
likely is it that insulators will be judged 
satisfactory when true average breaking 
strength is actually only 9.5? Only 9.0? 

b. What sample size would be necessary to 
have a 75% chance of detecting that Ho 
is false when true average breaking 
strength is 9.5 when the true standard 
deviation is .80? 


The article “Caffeine Knowledge, Atti- 
tudes, and Consumption in Adult Women” 
(J. Nutrit. Ed. 1992: 179-184) reports the 
following summary data on daily caffeine 
consumption for a sample of adult women: 
n=47, x=215mg, s=235 mg, and 
range = 5 — 1176. 


a. Does it appear plausible that the popu- 
lation distribution of daily caffeine 
consumption is normal? Is it necessary 
to assume a normal population distri- 
bution to test hypotheses about the 
value of the population mean con- 
sumption? Explain your reasoning. 

b. Suppose it had previously been believed 
that mean consumption was at most 
200 mg. Does the given data contradict 
this prior belief? Test the appropriate 
hypotheses at significance level .10 and 
include a P-value in your analysis. 


The incidence of a certain type of chro- 
mosome defect in the U.S. adult male 
population is believed to be 1 in 75. 
A random sample of 800 individuals in 
U.S. penal institutions reveals 16 who have 
such defects. Can it be concluded that the 
incidence rate of this defect among pris- 
oners differs from the presumed rate for the 
entire adult male population? 


91. 


92. 


93. 


94. 
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a. State and test the relevant hypotheses 
using « = .05. What type of error might 
you have made in reaching a 
conclusion? 

b. What P-value is associated with this 
test? Based on this P-value, could Hp be 
rejected at significance level .20? 


In an investigation of the toxin produced by 
a certain poisonous snake, a researcher 
prepared 26 different vials, each containing 
1 g of the toxin, and then determined the 
amount of antitoxin needed to neutralize 
the toxin. The sample average amount of 
antitoxin necessary was found to be 
1.89 mg, and the sample standard deviation 
was .42. Previous research had indicated 
that the true average neutralizing amount 
was 1.75 mg/g of toxin. Does the new data 
contradict the value suggested by prior 
research? Test the relevant hypotheses 
using the P-value approach. Does the 
validity of your analysis depend on any 
assumptions about the population distribu- 
tion of neutralizing amount? Explain. 


The sample average unrestrained compres- 
sive strength for 45 specimens of a partic- 
ular type of brick was computed to be 3107 
psi, and the sample standard deviation was 
188. The distribution of unrestrained com- 
pressive strength may be somewhat 
skewed. Does the data strongly indicate that 
the true average unrestrained compressive 
strength is less than the design value of 
3200? Test using « = .O01. 


To test the ability of auto mechanics to 
identify simple engine problems, an auto- 
mobile with a single such problem was 
taken in turn to 72 different car repair 
facilities. Only 42 of the 72 mechanics who 
worked on the car correctly identified the 
problem. Does this strongly indicate that 
the true proportion of mechanics who could 
identify this problem is less than .75? 
Compute the P-value and reach a conclu- 
sion accordingly. 


Chapter 8 presented a CI for the variance 


o of a normal population distribution. 
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95. 


96. 


The key result there was that the rv 
? = (n—1)S?/o* has a chi-squared dis- 
tribution with n — 1 df. Consider the null 
hypothesis Ho: 07 = a, (equivalently, 
0 = 00). Then when Hp is true, the test 
statistic 77 = (n—1)S?/o% has a chi- 
squared distribution with n — 1 df. If the 
Hy: 02 > Os 
rejecting Ho if 77 > y>,,_; gives a test with 


relevant alternative is 


significance level «. To ensure reasonably 
uniform characteristics for a particular 
application, it is desired that the true stan- 
dard deviation of the softening point of a 
certain type of petroleum pitch be at most 
50 °C. The softening points of ten different 
specimens were determined, yielding a 
sample standard deviation of .58 °C. Does 
this strongly contradict the uniformity 
specification? Test the appropriate 
hypotheses using « = .O1. 


Referring to the previous exercise, suppose 
an investigator wishes to test Ho: o* = .04 
versus H,: a” < .04 based on n = 21 obser- 
vations. The computed value of 20s7/.04 is 
8.58. Place bounds on the P-value and then 
reach a conclusion at level .01. 


When the population distribution is normal 
and n is large, the sample standard devia- 
tion S has approximately a normal distri- 
bution with E(S) © o and V(S) © o7/(2n). 
We already know that in this case, for any 
n, X is normal with E(X)=yp and 
V(X) =o? /n. 


a. Assuming that the underlying distribu- 
tion is normal, what is an approximately 
unbiased estimator of the 99th percentile 
0 = w+ 2.330? 

b. As discussed in Section 6.4, when the 
X;’s are normal X and S are independent 
rvs (one measures location whereas the 
other measures spread). Use this to 
compute V(0) and a, for the estimator 0 
of part (a). What is the estimated stan- 
dard error Gj? 


97. 


98. 
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c. Write a test statistic for testing Hp: 0 = 09 
that has approximately a standard normal 
distribution when Hp is true. If soil pH is 
normally distributed in a certain region 
and 64 soil samples yield x = 6.33, 
s =.16, does this provide strong evi- 
dence for concluding that at most 99% of 
all possible samples would have a pH of 
less than 6.75? Test using « = .O1. 


Let X,, Xz, ..., X,, be a random sample from 
an exponential distribution with parameter 
2. Then it can be shown that 22 5°X; has a 
chi-squared distribution with v = 2n (by 
first showing that 2/X; has a chi-squared 
distribution with v = 2). 


a. Use this fact to obtain a test statistic and 
rejection region that together specify a 
level « test for Ho: 4 = Lo versus each of 
the three commonly encountered alter- 
natives. [Hint: E(X,) == 1/1, so 
LL = Uo is equivalent to 2 = 1/Uo.] 

b. Suppose that ten identical components, 
each having exponentially distributed 
time until failure, are tested. The 
resulting failure times are 


95 16 11 3 42 71 225 64 87 123 
Use the test procedure of part (a) to 
decide whether the data strongly sug- 
gests that the true average lifetime is less 


than the previously claimed value of 75. 


Suppose the population distribution is nor- 
mal with known o. Let y be such that 
O0<y<«. For testing Hp: = Lo versus 
H,:  # Uo, consider the test that rejects 
Ho if either z > z,orz < —Z,_,, where the 
test statistic is Z = (X — up)/(a/V/n). 


a. Show that P(type I error) = «. 

b. Derive an expression for f(w'). [Hint: 
Express the test in the form “reject Ho if 
either x >c, or <c.”] 

c. Let A > 0. For what values of » (relative 
to «) will S(t + A) < B(uto — A)? 


Supplementary Exercises 


99. 


100. 


After a period of apprenticeship, an orga- 
nization gives an exam that must be passed 
to be eligible for membership. Let p = 
P(randomly chosen apprentice passes). The 
organization wishes an exam that most but 
not all should be able to pass, so it decides 
that p = .90 is desirable. For a particular 
exam, the relevant hypotheses are 
Ho: p = .90 versus H,: p # .90. Suppose 
ten people take the exam, and let X = the 
number who pass. 


a. Does the lower-tailed region {0, 1, ..., 
5} specify a level .01 test? 

b. Show that even though H, is two-sided, 
no two-tailed test is a level .O1 test. 

c. Sketch a graph of power as a function of 
p’ for this test. Is this desirable? 


A service station has six gas pumps. When 
no vehicles are at the station, let p; denote the 
probability that the next vehicle will select 
pumpi(di = 1,2,...,6). Based on a sample of 
size n, we wish to test Ho: pj =--- = Po 
versus the alternative H,: p, = p3=D)s, 
P2 = Pa = Po (note that H, is not a simple 
hypothesis). Let X be the number of 
customers in the sample that select an even- 
numbered pump. 


a. Show that the likelihood ratio test 
rejects Hy if either X >c or 
X < n—c. [Hint: When H, is true, let 
@ denote the common value of po, pa, 
and p¢.] 

b. Let n= 10 and c=9. Determine the 
power of the test both when Ho is true 


101. 


102. 


563 
and also when p2 = p4 = po = 1/10, 
Pi = P3 = ps = 7/30. 


Consider testing a pair of simple hypothe- 
ses Hp: 0 = 0 versus H,: 0 = 0,. Rather 
than prescribing the significance level and 
minimizing P(type II error), imagine trying 
to minimize the linear combination 
a-a+b-f for some specified constants 
a> 0 and b> 0. Show that a-a+b- f is 
minimized by using the rejection region 


“Xn3 7) a 

> 
Xn3 00) — b 
[Hint: Imitate the first half of the proof of 


the Neyman-Pearson Lemma, but use 
a-a+b- f in place of ka + fi]. 


LE Mies 
“f (x1, ats 


R= ‘ 


Refer back to the scenario introduced 
in Example 9.23, where Hp: = 1 versus 
H,: & = 2 was tested based on a sample 
from a Poisson(z) distribution. Suppose 
committing a type II error is considered 3 
times as problematic as a type I error, and 
so the manufacturers wish to minimize 
a+ 3p. 


a. Determine the test procedure that mini- 
mizes « + 3h when n = 5. [Hint: Refer 
back to the previous exercise. ] 

b. For the test procedure in part (a), what 
are a, f, and the (minimized) value of 
a + 3p? 

c. Repeat parts (a)-(b) for n = 10. 


®) 


Check for 
updates 


Introduction 

Chapters 8 and 9 presented confidence intervals (CIs) and hypothesis-testing procedures for single 
parameters, such as a population mean p and a population proportion p. In this chapter, we extend 
these methods to situations involving the means, proportions, and variances of two different popu- 
lation distributions. For example, let j4; and zz denote the true average decrease in cholesterol for two 
drugs. Then an investigator might wish to use results from patients assigned at random to two 
different groups as a basis for testing the null hypothesis uw; = U2 versus the alternative hypothesis 
[41 # fo. As another example, let p; denote the true proportion of all metal-on-metal hip replacements 
that fail, and let p. represent the true proportion of all ceramic-on-ceramic replacements that fail. 
Based on surveys of 500 people with each type of hip replacement, we might like an interval estimate 
for the difference p; — po. 


10.1 The Two-Sample z Confidence Interval and Test 


The inferences discussed in this section concern a difference uu; — between the means of two 
different population distributions. An investigator might, for example, wish to test hypotheses about 
the difference between the true mean stopping distances of two different braking systems under 
identical conditions. One such hypothesis would state that uw, — Wy = 0, ie., that “, = Uo. Alterna- 
tively, it may be appropriate to estimate 4, — 4, by computing a 95% CI. Such inferences would be 
based on a sample of stopping distances for each braking system. 


ASSUMPTIONS 1. X,, X>, ..., X, is a random sample from a population with mean 1 
and standard deviation a. 
2. Yi, Yo, ..., Y, is a random sample from a population with mean jy 
and standard deviation o>. 
3. The X and Y samples are independent of each other. 


The natural estimator of j1; — fy is X — Y, the difference between the corresponding sample means. 
The test statistic results from standardizing this estimator, so we need expressions for the expected 
value and standard deviation of X — Y. 
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PROPOSITION The expected value of X — Y is fu; — fy, so X — Y is an unbiased estimator 
of jt; — fy. The standard deviation of X — Y is 


Proof Both these results depend on the rules of expected value and variance presented in Chapter 5. 
By linearity of expectation, 


E(X—¥) = E(®)- EY) = - 


Because the X and Y samples are independent, X and Y are independent quantities, so the variance of 
the difference is the sum of V(X) and V(Y): 


VxX-Y)=V(xX)+V(Y) =—+— 
@-P)=veE)+vVy)=242 
The standard deviation of X — Y is the square root of this expression. a 


Regarding f4, — fy as a parameter 0, its estimator is 0 = X —Y with standard deviation Gj given 
by the proposition. When oa, and o2 both have known values, the test statistic will have the form 
(0 —null value)/o); this form of a test statistic was used in several one-sample problems in the 
previous chapter. If o, and o2 are unknown, the sample standard deviations must be used to estimate 


oa (the topic of Section 10.2). 


Confidence Interval for 4, — “4, With Known o’s 
In Chapters 8 and 9, the first CI and test procedure for a population mean yp were based on the 
assumption that the population distribution was normal with the value of the population standard 
deviation o known to the investigator. Similarly, we first assume here that both population distri- 
butions are normal and that the values of both o, and o> are known. 

Because the population distributions are normal, both X and Y have normal distributions. This 
implies that X — Y is normally distributed, with expected value 4, — fy and standard deviation ox _7 
given in the foregoing proposition. Standardizing X — Y gives the standard normal variable 


_X-Y—- (um — &) 


Z 5 : (10.1) 
oO (oy 
14 2 
m n 


Since the area under the z curve between —z,/2 and Z,,2 is 1 — @, it follows that 


X—Y¥—-(u — by) 


P < < =l-a 
Za/2 eo a Lu/2 
a ale 
m n 


Manipulation of the inequalities inside the parentheses to isolate ju; — Ly yields the equivalent 
probability statement 
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= eae ov... fea) - 
PIX Y Za/2 + < Ly Lo <X Y T Lu/2 T = 1 oA 
m n m n 


This implies that a 100(1 — «)% CI for “4, — fy has lower limit (x — y) — 2,2 - @g_y and upper limit 


(x — y) +z, /2° x_y- This interval is a special case of the general formula O+ Zy/2* Fj 

If both m and n are large, the CLT implies that both X and Y are approximately normal. In that 
case, this interval remains valid with a confidence level of approximately 100(1 — «)% irrespective of 
the population distributions. 


TWO-SAMPLE Assuming independent random samples from normal population distributions, a CI 
z INTERVAL for 4, — Ly with a confidence level of 100(1 — «)% has endpoints 


2 2. 
Sain o , 9% 
X— YH Zy/2 poe ar 


An upper or lower confidence bound can also be calculated by retaining the appro- 
priate sign (+ or —) and replacing 2Z,,2 by Z,. 

These confidence limits may also be applied to samples from nonnormal popula- 
tions, provided that both sample sizes are large (say, m > 40 and n> 40); the 
confidence level is then approximate. 


In practice, the assumption of known o’s is generally unrealistic. If information was available con- 
cerning the population standard deviations, typically both “, and “. would also be known. In 
Section 10.2, we will examine the more realistic scenario when the values of all four parameters are 
unknown. 


Example 10.1 The article “Reflective Tape Applied to Bicycle Frame and Conspicuity Enhance- 
ment at Night” (Hum. Factors 2017: 485-500) describes a series of studies to determine the distance 
at which drivers can see a bicyclist ahead in the road (“detection distance”) at night. One study 
compared detection distance when the bicycle had a typical red reflector mounted on the rear of the 
bike versus having reflective tape wrapped around the posterior forks, seat post, and rear reflector 
panel. The sample mean detection distances under these two conditions were x = 67.66 m and 
y = 168.28 m, respectively. 

Suppose these observations were based on independent random samples of m = n = 64 drivers, and 
that the population standard deviations under these two conditions are oj = 30 m and o2 = 40 m 
(values consistent with information in the article). Then a 95% CI for 4, — bo, the true difference in 
mean detection distance under these two settings, is 


302 ss 40? 
64 64 


(67.66 — 168.28) + 1.96 = —100.62 + 1.96(6.25) = (—112.87, —88.37) 


Note that the confidence level is approximate, because with large sample sizes but no assumption of 
normality we are relying on the CLT. The interval indicates that average nighttime detection distance 
is between 88.37 and 112.87 m greater (that is, better) for bikes using reflective tape versus those just 
relying on the standard red rear reflector. a 
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If the standard deviations o, and o2 are known and the investigator uses equal sample sizes, then 
the sample size m = n for each sample that yields a 100(1 — «)% interval of width w is 


_ 2 /2(F1 + 93) 
(w/2)° 


which will generally have to be rounded up to an integer. (Recall that w/2 represents the desired 


bound on the interval’s margin of error. The sample size formula results from setting w/2 = 
Zy/21/ 01 /n+ o3/n and solving for n.) 


Test Procedures for 4, — 4, with Known o’s 

In a hypothesis-testing problem, the null hypothesis will state that 4, — pu has a specified value. 
Denoting this null value by Ao, the null hypothesis becomes Ho: 4, — [) = Ao. For example, the null 
hypothesis might state that the difference in true average fuel efficiencies (mpg) between cars having a 
turbocharged engine and a nonturbo engine is —3 (that is, on average turbocharging decreases fuel 
efficiency by 3 mpg). The null value would be Ap = +3 if the subscripts 1 and 2 instead referred to 
nonturbo and turbo engines, in that order. Often Ag = 0, in which case Hp is equivalent to asserting 
that 4, = My. A test statistic results from replacing 4, — My in Expression (10.1) by the null value Ao. 
Because the test statistic Z is obtained by standardizing X — Y under the assumption that Hp is true, it 
has a standard normal distribution in this case. 

Consider the alternative hypothesis H,: 4, — [, > Ag. A value x — y that considerably exceeds Ao 
(the expected value of X — Y when Hp is true) provides evidence against Ho and for H,. Such a value 
of x — y corresponds to a positive and large value of z. Thus Ho should be rejected in favor of H, if 
zis greater than or equal to an appropriately chosen critical value. Because the test statistic Z has a 
standard normal distribution when Hp is true, the upper-tailed rejection region z > Zz, gives a test 
with significance level (type I error probability) «. Rejection regions for the other two alternatives 
Ag: fy — by < Ao and A: ft, — Uy # Ao that yield tests with desired significance level « are lower- 
tailed and two-tailed, respectively. 

As in the confidence interval discussion, z-based inference is still approximately correct here for 
samples from nonnormal populations provided both m and n are large. (Note, though, that here we 
still assume known population standard deviations.) 


TWO-SAMPLE z TEST Null hypothesis: Ho: 4; — Hy = Ao 


ae x—y—Ao 

Test statistic value: z = i eet 
oi, 9%} 

m n 


Alternative Hypothesis Rejection Region for Level « Test 


Ay: [ty — lo > Ao Z > Z,y (upper-tailed test) 
Aa: fy — Po < Ao z < —z, (lower-tailed test) 
Hi: by — bo # Ao eitherz > z,.0rz < —Z,/> (two-tailed test) 


Because these are z tests, a P-value is computed as it was for the z tests in Chapter 9: P-value = 
1— ®(z) for an upper-tailed test, = ®(z) for a lower-tailed test, and = 2[1 — ®(|z|)] for a two-tailed test. 

These test procedures may also be applied to samples from nonnormal populations, provided that 
both sample sizes are large (say, m > 40 and n > 40). 
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Example 10.2 Each student in a class of 21 responded to a questionnaire that requested their grade 
point average (GPA) and the number of hours they studied each week. For those who studied less 
than 10 h/week the GPAs were 


2.80 3.40 4.00 3.60 2.00 3.00 3.47 2.80 2.60 2.00 


and for those who studied at least 10 h/week the GPAs were 


3.00 3.00 2.20 2.40 4.00 2.96 3.41 327 3.80 3.10 2.50 


Normal probability plots for both sets are reasonably linear, so the normality assumption is 
tenable. Because the standard deviation of GPAs for the whole campus is o = .6, it is reasonable to 
apply that value here to both (conceptual) populations. The sample means are 2.97 for the <10 study 
hours group and 3.06 for the > 10 study hours group. Treating the two samples as random, is there 
evidence that true average GPA is higher for students who study more? Let’s carry out a test of 
significance at level .05 using the seven-step procedure outline in Section 9.2. 


1. Parameter: 4; — [y, the difference between true mean GPA for the (conceptual) <10 population 
and true mean GPA for the > 10 population 
2. Hypotheses: 
Ao: My — Hy = 0 Ge., My = My) 


Ay: [ly — ly < O(i.€., Ly < Hy) 


3. Assumptions/requirements: We have assumed underlying normal distributions for the GPAs of 
both populations, each with a known population standard deviation. 
4. Test statistic value: With Ao = 0, the test statistic value is 


+ = 
n 


oT, % 
m 
5. Rejection region: The inequality in H, implies that the test is lower-tailed. For « = .05, z, = 
Zos = 1.645. Ho will be rejected if z < —1.645. 
6. Substituting m = 10, x = 2.97, 0, =.6,n = 11, y = 3.06, and o> = .6 into the formula for z yields 


_2.97- 3.06 —.09 | 


= = —.34 
.262 . 


(is) 
| § 
N 


That is, the value of x — y is only one-third of a standard deviation below what would be 
expected when Hp is true. 

7. Because the value of z is not even close to the rejection region, there is no reason to reject the 

null hypothesis. This test does not provide convincing statistical evidence that students who 

study > 10h per week have a higher mean GPA than those studying <10 h per week. i] 


Power, f, and Sample Size Determination 
Both power and f (the probability of a type II error) are easily calculated when the population 
distributions are normal with known values of a, and o>. Consider the case in which the alternative 
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hypothesis is Hy: 4) — Uy > Ao. Let A’ denote a value of 4; — Uy that exceeds Ao, a value for which 
Ho is false and H, is true. The upper-tailed rejection region z > z, can be re-expressed in the form 
x — y>Ao+2,0x_y- Thus the probability of a type II error when pu, — fy = A’ is 


B(A’) = P(not rejecting Hy when py — fy = A’) 
= P(X —Y < Ap+z0%_¢ when py — py, =A’) 


When ft; — fy = A’, X — Y is normally distributed with mean value A’ and standard deviation o¢_> 
Ly — bb X-Y 
the same standard deviation as when Hp is true); using these values to standardize the inequality in 

8 quality 
parentheses gives f. 


Alternative Hypothesis (A’) = P(type II error when p, — p, = A’) 


N-A 
Ay: [ty — Un > Ao (a. ae 2) 
X-Y 
/ 
Hy: My — fy < Ao 1 of La — = 
X-Y 
A’ — Ao N= ao) 
A: _— A ® 4 ® OL 
ly — tn # Ao (. 1/2 Oz_> ) ( Za/2 Oy_y 


where ox_> = \/(a7/m) + (a3/n). For each case, power = 1 — f(A’). 


Example 10.3 (Example 10.2 continued) If “4, = py — .5 (true average GPA is .5 lower for the less- 
studious group), what is the probability of detecting such a departure from Ho based on a level .05 test 
with sample sizes m = 10 and n = 11? The value of ox_y for these sample sizes (the denominator of 
Z) was previously calculated as .262. The probability of a type II error for the lower-tailed level .05 
test when ju, — fy = A’ = —.5 is 


—5-—0 
.262 


B(-.5) =1 of 1.645 ) = | — 0(0.263) = .396 


Thus the probability of detecting such a departure is power = 1 — f(-.5) = .604. Clearly, we have a 
mediocre chance of detecting a difference of —.5 with these sample sizes. Perhaps we should not 
conclude from Example 10.2 that there is no relationship between study time and GPA, because the 
sample sizes were insufficient. a 


As in Chapter 9, sample sizes m and n can be determined that will satisfy both P(type I error) = a 
specified « and P(type II error when pt; — fy = A’) =a specified f. For an upper-tailed test, equating 
the previous expression for f(A’) to the specified value of f gives 


2 > AN’ — An)? 
A ( 0) 


mon (Zy + zp)” 


When the two sample sizes are equal, this equation yields 
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(of + 03) (zu + zp)” 
(A’ — Ao)? 


This expression is also correct for a lower-tailed test, whereas « is replaced by «/2 for a two-tailed 
test. 


Using a Comparison to Identify Causality 

Investigators are often interested in comparing either the effects of two different treatments on some 
outcome, or the response after treatment with the response after no treatment (treatment vs. control). If 
the individuals or objects to be used in the comparison are not assigned by the investigators to the two 
different conditions, the study is said to be observational. The difficulty with drawing conclusions 
based on an observational study is that although statistical analysis may indicate a significant dif- 
ference in response between the two groups, the difference may be due to some underlying factors 
that had not been controlled, rather than to any difference in the effects of the treatments. 


Example 10.4 Many investigations in the last several years have explored the potential benefits of 
consuming a moderate amount of alcohol. As reported by CNN (July 27, 2017), a Danish study found 
that light-to-moderate drinkers (those consuming a few glasses of wine 3-4 days per week) had a 
lower risk of developing diabetes than those who rarely consumed alcohol. The study was based on 
tracking more than 70,000 Danes over a five-year period, and a test of the difference between the new 
diabetes development rates between these two groups resulted in an extremely low P-value. The 
difference was also considered clinically meaningful; that is, the low P-value was not simply the 
result of the very large sample size. 

Should we conclude that moderate alcohol consumption causes a decreased likelihood of diabetes? 
Should health professionals recommend a few glasses of wine per day to help prevent diabetes onset? 
Not necessarily: since individuals in the study decided for themselves how much to drink, there could 
be some other underlying factor that the moderate drinkers have in common that would explain their 
lower risk of diabetes. For instance, most drinkers in the study specifically consumed wine, and wine- 
drinkers tend to be wealthier. Perhaps other lifestyle features of those wealthier individuals can help 
explained the observed relationship. (Using advanced statistical methods, the researchers “adjusted” 
for several factors including age, diet, and education, but they can’t account for every possible 
alternative explanation.) i 


Once upon a time, it was argued that the studies linking smoking and lung cancer were all obser- 
vational, and therefore that nothing had been proved. This was the view of the great statistician 
R. A. Fisher, who maintained till his death in 1962 that the observational studies did not show causation. 
He said that people who choose to smoke might be more susceptible to lung cancer. This explanation for 
the relationship had plenty of opposition then, and few would support it now. At that time few women got 
lung cancer because few women smoked, but when smoking increased among women, so did lung 
cancer. Furthermore, the incidence of lung cancer was higher for those who smoked more, and quitters 
had reduced incidence. Eventually, the physiological effects on the body were better understood, and 
nonobservational animal studies made it clear that smoking does, in fact, cause lung cancer. 

To establish causation through a statistical study, we must try to eliminate the possibility that the 
groups being compared (e.g., drinkers and nondrinkers) have some other distinguishing feature (e.g., 
wealth) that could explain the study results. A randomized controlled experiment results when 
investigators assign subjects to the two treatments in a random fashion. When statistical significance 
is observed in such an experiment, the investigator and other interested parties will have more 
confidence in the conclusion that the difference in response has been caused by a difference in 
treatments. 
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Example 10.5 Many advertisers have touted “green labeling’—discussing the environmental 
impact of a product in advertisements (e.g., “now with less phosphates”)—as a way to increase sales. 
But is this really effective? The article “How Green Should You Be: Can Environmental Associations 
Enhance Brand Performance?” (J. Mark. Res. 2008: 547-563) discussed a randomized controlled 
experiment in which shoppers were shown one of three brochures describing a (made-up) brand of 
detergent: one brochure that provided generic information about the brand, one that included infor- 
mation on the environmental performance of the brand, and one that additionally included an 
“environmental certification” label. Brochure types were randomly assigned to the study participants. 
The authors of the article then assessed participants’ attitude toward the brand, including the likeli- 
hood of purchasing that detergent. 

What the researchers found contradicted conventional wisdom: shoppers who saw the “green” 
advertisements were no more positively disposed to the product than those who had seen the generic 
advertisement. In other words, the presence of environmental information did not cause people to be 
more apt to purchase that brand of detergent. (To be more precise, the experiment uncovered no 
statistically significant differences in customer attitudes between the three brochures.) a 


Observational studies, experiments, and the issue of establishing causality are discussed at greater 
length in the (nonmathematical) books by Utts, Moore, and Freedman et al., listed in the 
bibliography. 


Exercises: Section 10.1 (1-12) 


1. An article in Consumer Reports compared 
various types of batteries. The average life- 


be the same for sample sizes of 10 bat- 
teries of each type? Explain. 


times of Duracell AA batteries and Ener- 9. According to a 2018 report by the CDC, the 
gizer AA batteries were given as 4.1 h and 


4.5 h, respectively. Suppose these are the 
population average lifetimes. 


mean body mass index (BMI) for American 
adult men is 29.1 kg/m*, while the mean for 
women is 29.6 kg/m’. Suppose these are 
a. Let X be the sample average lifetime of population averages. 
100 Duracell batteries and Y be the sam- 
ple average lifetime of 100 Energizer 
batteries. What is the mean value of X — 
Y (i.e., where is the distribution of X — Y 
centered)? How does your answer depend 


a. Let X be the sample average BMI of 50 
randomly selected American adult men Y 
be the sample average BMI of 75 ran- 
domly selected American adult women. 
What is the expected value of X — Y? 


on the specified sample sizes? 

. Suppose the population standard devia- 
tions of lifetime are 1.8 h for Duracell 
batteries and 2.0 h for Eveready batteries. 
With the sample sizes given in part (a), 
what is the variance of the statistic X — Y, 
and what is its standard deviation? 

. For the sample sizes given in part (a), 
draw a picture of the approximate distri- 
bution curve of X — Y (include a mea- 
surement scale on the horizontal axis). 
Would the shape of the curve necessarily 


How does your answer depend on the 
specified sample sizes? 


. Suppose the population standard devia- 


tions of BMI are 4.7 for men and 6.2 for 
women (these values are consistent with 
the study). With the sample sizes given in 
part (a), what is the variance of the statistic 
X — Y, and what is its standard deviation? 


. For the sample sizes given in part (a), 


what is the approximate distribution of 
X — Y, and why? 
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d. Would the shape of the distribution in part 
(c) necessarily be the same for sample 
sizes of 5 men and 7 women? Explain. 


3. Let yy, and p> denote true average tread lives 


(miles) for two competing brands of size 

P205/65R15 tires. 

a. Test Ho: ,—H, = 0 versus Hy: h;— py FO 
at level .05 using the following informa- 
tion: m=45, x = 42,500, o, = 2200, 
n = 45, y = 40,400, and a2 = 1900. 

b. Use the information in part (a) to compute 
a 95% CI for uw; — fy. Does the resulting 
interval suggest that fu, — f, has been 
precisely estimated? 


. Let 4, denote true average tread life for a 
premium brand of P205/65R15 tire and let 2 
denote the true average tread life for an 
economy brand of the same size. 


a. Test Ho: Uy — My = 5000 versus the alter- 
native H,: fy, — Ly > 5000 at level .01 
using the following — information: 
m = 45, x = 42,500, o, = 2200, n = 45, 
y = 36,800, and a2 = 1500. 

b. Use the information in part (a) to compute 
a 99% lower confidence bound for 
Ly — My. Is your answer consistent with 
the test in part (a)? 


. Persons having Raynaud’s syndrome are apt 
to suffer a sudden impairment of blood cir- 
culation in fingers and toes. In an experiment 
to study the extent of this impairment, each 
subject immersed a forefinger in water and 
the resulting heat output (cal/em?/min) was 
measured. For m= 10 subjects with the 
syndrome, the average heat output was 
x = .64, and for n = 10 nonsufferers, the 
average output was 2.05. Let m, and pp 
denote the true average heat outputs for the 
two types of subjects. Assume that the two 
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Describe in words what H, says, and then 
carry out the test. 

b. Compute the P-value for the value of 
Z obtained in part (a). 

c. What is the probability of a type II error 
when the actual difference between ju, and 
My iS [ly — Hy = —1.2? What is the power? 

d. Assuming that m = n, what sample sizes 
are required to ensure that f = .1 when 


My — My = — 1.2? 


. An experiment to compare the tension bond 


strength of polymer latex modified mortar to 
that of unmodified mortar resulted in x = 
18.12 kgf/cm? for the modified mortar 
(m = 40) and y= 16.87 kgf/em? for the 
unmodified mortar (n = 32). Let 4, and py be 
the true average tension bond strengths for 
the modified and unmodified mortars, 
respectively. Assume that the bond strength 
distributions are both normal. 


a. Assuming that o; = 1.6 and o> = 1.4, test 
Ho: Ly — My = 0 versus Hy: Wy — fy > O 
at level .O1. 

b. Compute the probability of a type II error 
for the test of part (a) when py, — Hy = L. 

c. Suppose the investigator decided to use a 
level .05 test and wished f = .10 when 
Ly — My = 1. Ifm = 40, what value of n is 
necessary? 


. What affects the time a consumer spends 


looking at a product on the shelf prior to 
selection? The following data summarized 
elapsed time (in seconds) for purchasers of 
fabric softener and washing-up liquid; the 
former is much more expensive than the lat- 
ter. These products were chosen because 
they’re similar with respect to shelf space and 
number of brands available. 


distributions of heat output are normal with —_ Product Sample sts Sample mea 
Fabric softener 15 30.42 
0, = .2 and o2 = 4. 
! 2 Washing-up liquid 19 26.53 


a. Consider testing Ho: pm, — pw, = —1.0 
versus Ha: My — fy< — 1.0 at level .01. 
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8. 


a. What assumptions, if any, are necessary 
for the inferential procedures of this sec- 
tion to be valid in this situation? Why? 

b. Assuming that a, = 02 = 8.5 s, test to see 
if there is a significant difference in the 
true average time purchasers spend look- 
ing at these two products, at the « = .01 
significance level. 


An experiment was performed to compare 
the fracture toughness of high-purity nickel 
maraging steel with commercial-purity steel 
of the same type. The sample average 
toughness was x = 65.6 for m = 32 speci- 
mens of the high-purity steel, whereas for 
n = 38 specimens of commercial _ steel 
y = 59.8. Because the high-purity steel is 
more expensive, its use for a certain appli- 
cation can be justified only if its fracture 
toughness exceeds that of commercial-purity 
steel by more than 5. Suppose that both 
toughness distributions are normal. 


a. Assuming that o; = 1.2 and o> = 1.1, test 
the relevant hypotheses using « = .OOL. 

b. Compute f and power for the test con- 
ducted in part (a) when fl, — My = 6. 


. A study seeks to compare hospitals based on 


the performance of their intensive care units. 
The response variable is the mortality ratio, 
the ratio of the number of deaths over the 
predicted number of deaths based on the 
condition of the patients. The comparison 
will be between hospitals with nurse staffing 
problems and hospitals without such prob- 
lems. Assume, based on past experience, that 
the standard deviation of the mortality ratio 
will be around .2 in both types of hospital. 
How many of each type of hospital should be 
included in the study in order to have both 
the type I and type II error probabilities be 
.05, if the true difference of mean mortality 
ratio for the two types of hospital is .2? If we 
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conclude that hospitals with nurse staffing 
problems have a higher mortality ratio, does 
this imply a causal relationship? Explain. 


. To decide whether chemistry or physics 


majors have higher starting salaries in 
industry, n B.S. graduates of each type are 
surveyed, yielding x = $61,500 for chem- 
istry and y = $61,000 for physics. Assume 
o = $2500 for both populations. 

Calculate the P-value for the appropriate two- 
sample z test, assuming that the data was 
based on n = 100. Then repeat the calcula- 
tion for n = 400. Is the small P-value for 
n = 400 indicative of a difference that has 
practical significance? Would you have been 
satisfied with just a report of the P-value? 
Comment briefly. 


. a. Show for the upper-tailed test with o, and 


02 known that as either m or n increases, f 
decreases when pt; — [ly > Ao. 

b. For the case of equal sample sizes 
(m = n) and fixed «, what happens to the 
necessary sample size n as [ is decreased, 
where f is the desired type II error 
probability at a fixed alternative? 


. The level of monoamine oxidase (MAO) ac- 


tivity in blood platelets (nm/mg_protein/h) 
was determined for each individual in a 
sample of 43 chronic schizophrenics, result- 
ing in x = 2.69, as well as for 45 normal 
subjects, resulting in y = 6.35. Assume that 
0, =2.3 and o,=4.0. Does this data 
strongly suggest that true average MAO 
activity for normal subjects is more than twice 
the activity level for schizophrenics? Derive a 
test procedure and carry out the test using 
a = 01. [Hint: Let pw, and po refer to true 
average MAO activity for schizophrenics and 
normal subjects, respectively, and consider 
the parameter 0 = 2, — to. Write Ho and 
H, in terms of 0, estimate 0, and derive o4.] 
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10.2. The Two-Sample t Confidence Interval and Test 


In the previous section, we illustrated the use of a CI and test procedure for the difference of two 
means under the assumptions of normally distributed populations with known standard deviations. 
For large samples, the CLT allows us to use these methods even when the two populations of interest 
are not normal. 

In practice, though, it is virtually always the case that the values of the population standard 
deviations are unknown. We now proceed by extending the one-sample ¢ procedures from Chapters 8 
and 9 to the analysis of a difference of means. Such inferential methods still assume normal popu- 
lation distributions, though (as discussed below) that assumption can be relaxed for large sample 
sizes. 

We continue under the assumptions 1-3 stated at the beginning of Section 10.1. Since it is no 
longer assumed that the population standard deviations a, and g2 are known, they will be replaced in 
Expression (10.1) by the sample standard deviations S, and S5, respectively. The following theorem 
stems from a result first presented by B. L. Welch in 1938. 


WELCH’S THEOREM When the population distributions are both normal, the standardized variable 


X=¥ = (j= ps) 
2 8 
at) ee 


m n 


T= 


(10.2) 


has approximately a f distribution with df v estimated from the data by 


se s3\? 
— (st/m)* | (s3/n)”—(ser)" | (se2)" | 


m— 1 n—1l m— | n—1l 


where sey = s1/,/m and sex = s2/,/n (round v down to the nearest integer). 


The cumbersome Expression (10.3) is called Welch’s degrees of freedom (or Welch-Satterthwaite, 
after another statistician researching this problem around the same time). Of course, statistical soft- 
ware packages have (10.3) built in. Manipulating T from (10.2) in a probability statement to isolate 
[ly — My gives a CI, whereas a test statistic results from replacing “4, — “ by the null value Ao. 


TWO-SAMPLE _ The two-sample ¢ confidence interval for , — , with approximate confi- 
t PROCEDURES dence level 100(1 — «)% is 


es S| 83 
X—-yYu ty /2,y ai ae >] 


where v = Welch’s df formula (10.3). One-sided confidence bounds can be cal- 
culated by retaining the appropriate sign (+ or —) and replacing t,/, by ty,v. 


The two-sample ¢ test for testing Ho: 1“, — [ly = Apo is as follows: 
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Alternative Hypothesis Rejection Region for Approximate Level « Test 

Aa: by — by > Ao t > ty,» (upper-tailed test) 

Aa: My — by < Ao t < —t,,, (lower-tailed test) 

Aa: [ty — by # Ao either f > ty Ort < —tyy (two-tailed test) 
x—y—Ao 


Test statistic value: t = ————— 
2 2 

/s S5 
5 + 3 

m on 


A P-value can be computed as described in Section 9.4 for the one-sample ¢ test. 


Example 10.6 Which way of dispensing champagne, the traditional vertical method or a tilted 
“beer-like” pour, preserves more of the tiny gas bubbles that improve flavor and aroma? The fol- 
lowing data was reported in the article “On the Losses of Dissolved CO2 during Champagne Serving” 
(J. Agr. Food Chem. 2010: 8768-6775). 


Temperature (°C) Type of pour n Mean (g/L) SD 
18 Traditional 4 4.0 a 
18 Slanted 4 3:7 3 
12 Traditional 4 3.3 2 
12 Slanted 4 2.0 3 


Assuming that the sampled distributions are normal, let’s calculate confidence intervals for the 
difference between true average dissolved CO, loss for the traditional pour and that for the slanted 
pour at each of the two temperatures. 

For the 18°C temperature, Welch’s df is 


52 32\? 
ve (7+3) _ 007225 _ 4 yy 
(52/4)? (32/4)? 00147083” 

3 = 3 


Rounding down, the CI will be based on 4 df. For a confidence level of (approximately) 99%, we 
need f.995,4 = 4.604. The desired interval is 


2 2 


4.0 — 3.7 + (4.604) = a = = 3 +4 (4.604)(.2915) = .3 41.3 = (-1.0, 1.6) 


Thus we can be highly confident that —1.0 < uw; — fy < 1.6, where pt; and pp are true average losses 
for the traditional and slant methods, respectively. Notice that this CI contains 0, so at the 99% 
confidence level, it is plausible that u, — u, = 0, that is, that wy = ph. 

The df formula for the 12°C comparison yields df = .00105625/.00020208 = 5.23, necessitating 
the use of fo95,5 = 4.032 for a 99% CI. The resulting interval is (.6, 2.0). Thus 0 is not a plausible 
value for this difference. It appears from the CI that the true average loss when the slant method is 
used is smaller than that when the traditional method is used, so that the slant method is better at this 
temperature. This in fact was the conclusion reported in the popular media. Bo 
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Example 10.7. What color should you use for your Web site’s background? The authors of “Waiting 
for the Web: How Screen Color Affects Time Perception” (J. Mark. Res. 2004) compared subjects’ 
time perception based on the background color of a Web site being downloaded. Subjects were 
randomly assigned to see a blue background or a yellow background; the Web sites were otherwise 
identical, including the actual download time. Data consistent with the information in the article 
appears in Figure 10.1. Values of summary statistics appear in Table 10.1. 


Website background color 


1 2 3 4 5 6 


Perceived quickness rating 


Figure 10.1 Comparative boxplot for Example 10.7 


Table 10.1 Values of summary statistics for Example 10.7 


Perceived quickness rating 


n Mean SD 
Blue 25 3.67 1.07 
Yellow 24 3.04 1.07 


Let’s test to see whether background color affects users’ average perception of download time, at 
the 5% level. (A higher “perceived quickness” rating indicates the subject thought the page down- 
loaded faster.) 


1. The parameters of interest are 
/, = true mean perceived quickness rating with a blue background 
Hy = true mean perceived quickness rating with a yellow background 

2. Ho: My — ty = 0 
Ay: [ly — fy FO 

3. Subjects were randomly assigned to blue or yellow background color, so it is reasonable to treat 
the groups’ responses as independent. Normal probability plots of data consistent with the 
article appear in Figure 10.2; the patterns in both plots are reasonably linear, so neither one 
suggests a marked deviation from normality. (We can be a little forgiving about some curvature 
here, since m = 25 and n = 24 are not too small.) It is therefore valid to use the two-sample f test 
for this analysis. 
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Figure 10.2 Normal probability plots for Example 10.7 


4. The null value is Ag = 0, so the test statistic value is 


x—y 
cam 2 2, 

here 

m n 


5. Welch’s df formula (10.3) gives v = 46.9, which we round down to 46 df. From software, 
025.46 = 2.013, so we will reject Ho if either t > 2.013 ort < -2.013. 
6. Using the values in Table 10.1, 


, _ 3.67 -3.04)-0 _ 4 


jose =. (1.07) 


25 24 


7. Since 2.06 > 2.013, using a significance level of .05 we can (barely) reject the null hypothesis 
in favor of the alternative hypothesis, confirming the conclusion stated in the article: users’ 
perceptions of the speed at which a Web site downloads differ depending on whether the 
background color is blue or yellow. However, someone demanding more compelling evidence 
might select « = .01, a level for which Hp cannot be rejected. 

Using the P-value approach, for this two-tailed test and with the aid of software, 


P-value ~ 2-P(T > 2.06 when T ~ tye) = 2(.023) = .046 


Because .046 < .05, Ho would again barely be rejected at the « = .05 significance level (but not 
rejected at the .01 level, since .046 > .01). 

This isn’t the whole story: the same study also measured the sense of relaxation users felt when 
viewing the Web sites. They found that an increased sense of relaxation associated with the color blue 
accounted for subjects’ higher average perceived quickness. Blue backgrounds don’t make down- 
loads seem quicker per se; they relax the user more and make download time less noticeable. 
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Motivation for Welch’s Theorem 
Dividing numerator and denominator of (10.2) by the standard deviation of the numerator gives 


The numerator of this ratio is a standard normal rv because it results from standardizing the normally 
distributed difference X — Y. The denominator is independent of the numerator because the sample 
variances are independent of the sample means for normal samples. However, in order for (10.2) to be 
a t random variable, the denominator needs to be the square root of a chi-squared rv divided by its df, 
and unfortunately this is not the case. So let us try to express the denominator at least approximately 


as \/W/v with W ~ 72, yielding 


349. (442)4 


m n m n v 


To determine v we equate the means and variances of both sides, with the help of E(W) = y, 
V(W) = 2v, (m— 1)St/o} ~ 72,_,, and (n — 1)S3/05 ~ 72_, from Sections 6.3 and 6.4. It follows that 
E(St) = 07, V(S{) = 20}/(m — 1), and similarly for 53. The mean of the left-hand side is 


SF 8 oe 
o(4,8)-4,2 


m n m n 


which is also the mean of the right-hand side, so the means are equal. The variance of the left-hand 


side is 
. Sf 4 Ss _ 204 rm 203 
moon (m—1)m? — (n—1)n? 


and the variance of the right-hand side is 


2 2 
V 1% W) _ 1% Vv 1% 2 
mo njy m nj yv m onjy 
Now equate these two variances, substituting sample variances for the unknown population variances, 
and solve for v. This gives Expression (10.3) in Welch’s Theorem. 


Large-Sample t Procedures 
We have seen in previous chapters that based CIs and hypothesis tests can be applied to data from 
nonnormal populations provided that the sample sizes are sufficiently large. The same is true for the 
two-sample ft procedures: Welch’s Theorem is still approximately correct even if the X’s and Y’s are 
not sampled from normal distributions, so long as both sample sizes are large enough. We’lI continue 
to use the convention that m > 40 and n > 40 qualify as “large” samples. 

Also in parallel with previous chapters, for large samples there is little practical difference between 
using z or f critical values for inference. It can be shown (Exercise 97) that Welch’s df satisfies v > 
min(m — 1, n — 1), so that v is large if both m and n are. In that situation, using z values for the 
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procedures of this section—equivalently, substituting s,; and sz for o,; and o2 in the two-sample 
z procedures of Section 10.1—will yield similar results to the two-sample f procedures. 


Example 10.8 A study was carried out in an attempt to improve student performance in a low-level 
university mathematics course. Experience had shown that many students had fallen by the wayside, 
meaning that they had dropped out or completed the course with minimal effort and low grades. The 
study involved assigning the students to sections based on odd or even Social Security number. It is 
important that the assignment to sections not be on the basis of student choice, because then the 
differences in performance might be attributable to differences in student attitude or ability. Half of 
the sections were taught traditionally, whereas the other half were taught in a way that hopefully 
would keep the students involved. They were given frequent assignments that were collected and 
graded, they had frequent quizzes, and they were allowed retakes on exams. 

Prof. Lotus Hershberger conducted the experiment and he supplied the final exam scores, out of 40 
points possible, for the 79 students taught traditionally (the control group) and for the 85 students 
taught with more involvement (the experimental group). Table 10.2 summarizes the data. Does this 
information suggest that true mean for the experimental condition exceeds that for the control con- 
dition? Let’s use a test with a = .05. 


Table 10.2 Summary results for Example 10.8 


Group Sample size Sample mean Sample SD 
Control 79 23.87 11.60 
Experimental 85 27.34 8.85 


Let 4, and yo denote the true mean scores for the control condition and the experimental condition, 
respectively. The two hypotheses are Ho: fy, — fy) = 0 versus Hy: fy — Uy < 0. Welch’s degrees of 
freedom v = 145 here; since the t,45 and z curves are virtually indistinguishable, we’ll use the latter 
here. Ho will be rejected if z < —zo5 = —1.645. Then 


(23.87 — 27.34)-0  —3.47_ 


11.602 x 8.852 1.620 — 
79 85 


2.14 


Since —2.14 < —1.645, Ho is rejected at significance level .05. Alternatively, the P-value for a 
lower-tailed z test is 


P-value = ®(z) = ®(—2.14) = .016 


which implies rejection at significance level .05. 

We have shown fairly conclusively that the experimental method of instruction is an improvement. 
Nevertheless, there is more to be said. It is important to view the data graphically to see if there is 
anything strange. Figure 10.3 combines a boxplot and dotplot. 

The plot shows that both groups have outlying observations at the low end; some students showed 
up for the final but performed very poorly. What happens if we compare the groups while ignoring the 
low performers whose scores are below 10? The resulting summary information is in Table 10.3. 
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Figure 10.3 Boxplot/dotplot for the teaching experiment 


Table 10.3 Summary results without poor performers 


Group Sample size Sample mean Sample SD 
Control 61 29.59 5.005 
Experimental 76 29.88 4.950 


Notice that the means and standard deviations for the two groups are now very similar. Indeed, 
based on Table 10.3 the test statistic value is —.34, giving no reason to reject the null hypothesis. For 
the majority of the students, there appears to be not much effect from the experimental treatment. It is 
the low performers who make a big difference in the results. There were 18 low performers in the 
control group but only 9 in the experimental group. The effect of the experimental instruction is to 
decrease the number of students who perform at the bottom of the scale. This is in accord with the 
goals of the experimental treatment, which was designed to keep students on track. a 


Pooled t Procedures 

Alternatives to the two-sample t procedures described in this section result from assuming not only that 

the two population distributions are normal but also that they have equal, albeit unknown, standard 

deviations (o, = as). That is, the two population distribution curves are assumed normal with equal 

spreads, the only possible difference between them being where they are centered (i.e., at 4, and pup). 
Let o denote the common population standard deviation. Then standardizing X — Y gives 


7 Xa = (= ty) _ XY ~ (mH ~ ) 


oe oe 1 1 
— + — Oo —+- 
m n Vmoon 


which has a standard normal distribution. Before this variable can be used as a basis for making 
inferences about 4, — 5, the common variance must be estimated from sample data. One estimator 
of 07 is Nie the variance of the m observations in the first sample, and another is S32, the variance of the 


second sample. Intuitively, a better estimator than either individual sample variance results from 
combining the two sample variances. A first thought might be to use (Si +55)/2, the ordinary 
average of the two sample variances. However, if m>n then the first sample contains more 
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information about o” than does the second sample, and an analogous comment applies if m < n. The 
following weighted average of the two sample variances, called the pooled (i.e., combined) estimator 
of 0’, adjusts for any difference between the two sample sizes: 

2 m—-1 oo n-l1 oOo 


Sp rey es core 


It can be shown (Exercise 39) that S? is proportional to a chi-squared rv with m + n — 2 df. In turn, the 
rv that results if S replaces o° in the above Z statistic follows a f distribution with m +n — 2 df 
(Exercise 40). In the same way that earlier standardized variables were used as a basis for deriving 
confidence intervals and test procedures, this ¢t variable immediately leads to the pooled t confidence 
interval for estimating W, — [ly and the pooled t test for testing hypotheses about a difference between 
means. In particular, the pooled f test statistic for testing Ho: “, — HL, = Ao is 


iS 2 2 a 1 i: 
By Sp aren 
m n 


and T, ~ tm4n—-2 when Ho is true. 

In the past, many statisticians recommended these pooled ¢ procedures over the two-sample 
t procedures. The pooled ¢ test, for example, can be derived from the likelihood ratio principle, 
whereas the two-sample f test is not a likelihood ratio test. Furthermore, the significance level for the 
pooled f¢ test is exact, whereas it is only approximate for the two-sample f¢ test. Finally, power and 
sample sizes calculations using the pooled ¢ procedure can easily be performed by software using a 
noncentral ¢ distribution (Exercise 123). 

However, statistical research has shown that while the pooled ¢ test does outperform the two- 
sample f test by a bit (more power for the same «) when o, = o,, the former test can easily lead to 
erroneous conclusions if applied when the population standard deviations are different. Analogous 
comments apply to the behavior of the two confidence intervals. That is, the pooled t procedures are 
not robust to violations of their equal variance assumption. 

It has been suggested that one could carry out a preliminary test of Hp: 0, = c, and use a pooled 
t procedure if this null hypothesis is not rejected. Unfortunately, the usual “F test” of equal variances 
(Section 10.5) is quite sensitive to the assumption of normal population distributions, much more so 
than ¢ procedures. We therefore recommend the conservative approach of using two-sample ¢ pro- 
cedures unless there is really compelling evidence for doing otherwise, particularly when the two 
sample sizes are different. 


Power and Type II Error Probabilities 
Determining power and f for the two-sample ¢ test is complicated. The most recent versions of R, 
SAS, Minitab, and JMP will calculate power for the pooled t test—that is, assuming a common value 
for o; and o2—but not for the two-sample ¢ test. However, Prof. Russell Lenth (Univ. of Iowa) has 
developed a Java software package that performs such power calculations for the two-sample f test; 
the package can be downloaded for free from his Web site. The software will also calculate sample 
sizes necessary to obtain a specified power for a particular value of fy — Lp. 

In general, power will increase (f will decrease) as the sample sizes increase, as « increases, and as 
[ty — fy moves farther from Ap. When m and n are both large, the quantity T in (10.2) also has an 
approximately normal distribution, and so the power, f, and sample size formulas from Section 10.1 
provide approximately correct values. Population sd’s in those formulas can be replaced by sample sd’s. 
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Exercises: Section 10.2 (13-42) 


13. 


14. 


Mothers 
Fathers 


15. 


Determine the number of degrees of free- 
dom for the two-sample ¢ test or CI in each 
of the following situations: 


a. m= 10, n = 10, s; = 5.0, sy = 6.0 
b. m= 10, n = 15, s; = 5.0, sy = 6.0 
c. m= 10,n = 15, s; = 2.0, sy = 6.0 
d. m= 12, n= 24, s; = 5.0, sx = 6.0 


A 2008 study in the J. Family Econ. Issues 
compared the work and home habits of 
female and male lawyers in Canada. All 
participants in the survey had at least one 
child; single parents and couples who are 
both lawyers were excluded. Two of the 
variables measured were the weekly number 
of hours spent in the office and the number of 
hours spent with their children on weekdays. 


Weekly work 
hours 


n Mean SD 


230 41.38 11.90 
604 48.09 10.30 


Weekday 
hours w/kids 


Mean SD 


3.27 1.68 
1.82 1.29 


a. Estimate, with 95% confidence, the 
difference in the average weekly num- 
ber of work hours between mothers and 
fathers who practice law. 

b. Estimate, with 95% confidence, the 
difference in the average number of 
hours female and male lawyers spend 
with their kids on weekdays. Then, 
convert this interval into an estimate for 
the difference in average weekly hours 
spent with kids (Hint: The first interval 
is a daily average, and a work week has 
5 days). 


[Note: Interestingly, the study also found 
that “contrary to assumptions in the litera- 
ture and the workplace, mothers practicing 
law are significantly more committed to 
their careers than fathers.”’] 


The article “Return Migration, Investment in 
Children, and Intergenerational Mobility: 
Comparing Sons of Foreign- and Native- 
Born Fathers” (J. Hum. Res. 2008: 299-324) 


Foreign-born 
Native-born 


16. 


17. 
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presented the following summary data on 
years of education both for sample of sons of 
native-born fathers in Germany and another 
sample of sons of foreign-born fathers. 


n x RY 
251 9.2 1.9 
640 11.7 2.6 


Does the true average years of education for 
sons of native-born fathers appear to exceed 
that for those with foreign-born fathers? 
State and test the appropriate hypotheses 
using a significance level of .O1. 

The accompanying time-to-repair (min) 
data for both high rail and low rail breaks 
on curved track appeared in the article 
“Uncertainty Estimation in Railway Track 
Life-Cycle Cost” (J. Rail Rapid Transit 
2009: 285-293). (On a curved track, the 
high rail is the outer rail with the larger 
radius, while the low rail is the inner rail 
with the smaller radius.) 


High: 159 120 480 149 270 547 340 
43 228 202 240 218 

Low: 258 154 216 240 169 75 340 
202 202 216 


Normal probability plots of both samples 
show reasonably linear patterns. 


a. Construct a comparative boxplot and 
comment on interesting features. 

b. Carry out a test of hypotheses at sig- 
nificance level .10 to decide if there is 
evidence for concluding that true aver- 
age repair time for high rails exceeds 
that for low rails by more than 30 min. 

c. Obtain and interpret a confidence 
interval at the 90% confidence level for 
the difference between true average 
repair times for high and low rails. 


Due to recent concerns about player con- 
cussions, football helmets have recently 
increased in both size and mass. Have these 
changes made a difference? The article 
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Helmet 24 
No helmet 24 


Helmet 24 


“The Effects of Helmet Weight on 
Hybrid HI Head and Neck Responses by 
Comparing Unhelmeted and Helmeted 
Impacts” (J. Biomech. Engr. 2016) reports 
on an experiment in which repeated impact 
trials were performed on an artificial human 
head both wearing and not wearing a 
football helmet. 


a. The following summary information for 
the variable head acceleration (g) is 
consistent with information in the article. 


Mean SD 


43.1 4.5 
75.4 12 


Sample size 


Test whether the average head acceler- 
ation is reduced by helmet wear at the 
.05 significance level. 


b. The researchers were concerned that the 
mass of the helmet might increase the 
force experienced by the upper neck. 
The following summary information for 
resultant neck force (Newtons) is con- 
sistent with information in the article. 


Mean SD 
1331 93 


Sample size 


No helmet 24 945 77 


18. 


Test whether the average resultant neck 
force is increased by helmet wear at the 
.05 level. 

[Note: The authors conclude that “the 
increased neck forces provide a possible 
explanation as to why there has not 
been a ... reduction in concussion rates 
despite improvements in helmets’ abil- 
ity to reduce head accelerations.”’] 


c. If the null hypotheses in (a) and (b) are 
in fact both true, what can be said about 
the chance that at least one type I error 
is committed by the two tests? 


The article “Evaluation of a Ventilation 
Strategy to Prevent Barotrauma in Patients 
at High Risk for Acute Respiratory Dis- 
tress Syndrome” (New Engl. J. Med. 1998: 


19. 


Eat fast food 


No 
Yes 
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355-358) reported on an experiment in 
which 120 patients with similar clinical 
features were randomly divided into a 
control group and a treatment group, each 
consisting of 60 patients. The sample mean 
ICU stay (days) and sample standard 
deviation for the treatment group were 
19.9 and 39.1, respectively, whereas these 
values for the control group were 13.7 
and 15.8. 


a. Calculate a point estimate for the dif- 
ference between true average ICU stay 
for the treatment and control groups. 
Does this estimate suggest that there is a 
significant difference between true 
average stays under the two conditions? 

b. Answer the question posed in part (a) by 
carrying out a formal test of hypotheses. 
Is the result different from what you 
conjectured in part (a)? 

c. Does it appear that ICU stay for patients 
given the ventilation treatment is normally 
distributed? Explain your reasoning. 

d. Estimate true average length of stay for 
patients given the ventilation treatment 
in a way that conveys information about 
precision and reliability. 


What impact does fast-food consumption 
have on various dietary and health charac- 
teristics? The article “Effects of Fast-Food 
Consumption on Energy Intake and Diet 
Quality among Children in a National 
Household Study” (Pediatrics 2004: 112- 
118) reported the accompanying summary 
data on daily calorie intake both for a sample 
of teens who said they did not typically eat 
fast food and another sample of teens who 
said they did usually eat fast food. 


Sample size Sample mean Sample SD 
663 2258 1519 
413 2637 1138 


a. Estimate the difference between true 
average calorie intake for teens who typ- 
ically don’t eat fast foods and true average 
intake for those who do eat fast foods, and 
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do so in a way that conveys information 
about reliability and precision. 

b. Does this data provide strong evidence 
for concluding that true average calorie 
intake for teens who typically eat fast 
food exceeds true average intake for 
those who don’t typically eat fast food 
by more than 200 cal/day? Carry out a 
test at significance level .05 based on 
determining the P-value. 


20. Much research has focused on comparing 


business environment cultures across several 
countries. The article “Perception of Internal 
Factors for Corporate Entrepreneurship: A 
Comparison of Canadian and U.S. Man- 
agers” (Entrep. Theory Pract. 1999: 9-24) 
presented the following summary data on 
hours per week managers spent thinking 
about new ideas. 


Country Sample size Sample mean Sample SD 
US. 174 5.8 6.0 
Canada 353 5.1 4.6 


21. 


Does it appear that true average time per 
week that U.S. managers spend thinking 
about new ideas differs from that for Cana- 
dian managers? State and test the relevant 
hypotheses. 


Credit card spending and resulting debt 
pose very real threats to consumers in 
general, and the potential for abuse is 
especially serious among college students. 
It has been estimated that about two-thirds 
of all college students possess credit cards, 
and 80% of these students received cards 
during their first year of college. The article 
“College Students’ Credit Card Debt and 
the Role of Parental Involvement: Impli- 
cations for Public Policy” (J. Public Policy 
Mark. 2001: 105-113) reported that for 209 
students whose parents had no involvement 
whatsoever in credit card acquisition or 
payments, the sample mean total account 
balance was $421 with a sample standard 
deviation of $686, whereas for 75 students 
whose parents assisted with payments even 


22. 


23. 
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though they were under no legal obligation 
to do so, the sample mean and sample 
standard deviation were $666 and $1048, 
respectively. All sampled students were at 
most 21 years of age. 


a. Do you think it is plausible that the 
distributions of total debt for these two 
types of students are normal? Why or 
why not? Is it necessary to assume 
normality in order to compare the two 
groups using an inferential procedure 
described in this chapter? Explain. 

b. Estimate the true average difference 
between total balance for noninvolve- 
ment students and _ postacquisition- 
involvement students using a method 
that incorporates precision into the 
estimate. Then interpret the estimate. 
[Note: Data was also reported in the 
article for preacquisition involvement 
only and for both pre- and postacquisi- 
tion involvement. ] 


Returning to the previous exercise, the 
mean and standard deviation of the number 
of credit cards for the no-involvement 
group were 2.22 and 1.58, respectively, 
whereas the mean and standard deviation 
for the payment-help group were 2.09 and 
1.65, respectively. Does it appear that the 
true average number of cards for no- 
involvement students exceeds the average 
for payment-help students? Carry out an 
appropriate test of significance. 


Expert and amateur pianists were compared 
in a study “Maintaining Excellence: 
Deliberate Practice and Elite Performance 
in Young and Older Pianists” (J. Exp. Psy- 
chol. Gen. 1996: 331-340). The researchers 
used a keyboard that allowed measurement 
of the force applied by a pianist in striking a 
key. All 48 pianists played Prelude Number 
1 from Bach’s Well-Tempered Clavier. For 
24 amateur pianists the mean force applied 
was 74.5 with standard deviation 6.29, and 
for 24 expert pianists the mean force was 
81.8 with standard deviation 8.64. Do 
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expert pianists hit the keys harder? 
Assuming normally distributed data, state 
and test the relevant hypotheses, and 
interpret the results. 


The article “Supervised Exercise Versus 
Non-Supervised Exercise for Reducing 
Weight in Obese Adults” (J. Sport. Med. 
Phys. Fit. 2009: 85-90) reported on an 
investigation in which participants were 
randomly assigned either to a supervised 
exercise program or a control group. Those 
in the control group were told only that they 
should take measures to lose weight. After 
4 months, the sample mean decrease in 
body fat for the 17 individuals in the 
experimental group was 6.2 kg with a 
sample standard deviation of 4.5 kg, 
whereas the sample mean and sample 
standard deviation for the 17 people in the 
control group were 1.7 kg and 3.1 kg, 
respectively. Assume normality of the two 
body fat loss distributions (as did the 
investigators). 


a. Calculate a 99% lower prediction bound 
for the body fat loss of a single ran- 
domly selected individual subjected to 
the supervised exercise program. Can 
you be highly confident that such an 
individual will actually lose body fat? 

b. Does it appear that true average 
decrease in body fat is more than 2 kg 
larger for the experimental condition 
than for the control condition? Carry out 
a test of appropriate hypotheses using a 
significance level of .01. 


Fusible interlinings are being used with 
increasing frequency to support outer fab- 
rics and improve the shape and drape of 
various pieces of clothing. The article 
“Compatibility of Outer and Fusible Inter- 
lining Fabrics in Tailored Garments” (Tex- 
tile Res. J. 1997: 137-142) gave the 
accompanying data on extensibility (%) at 
100 g/em for both high-quality fabric 
(H) and poor-quality fabric (P) specimens. 


26. 
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1.2 2 7 10 $17 #17 11 9 17 
19 #13 21 16 %18 %14 #13 #19 16 
8 20 17 16 2.3 2.0 

16 15 Ll 21 #15 13 10 2.6 


a. Construct normal probability plots to 
verify the plausibility of both samples 
having been selected from normal pop- 
ulation distributions. 

b. Construct a comparative boxplot. Does 
it suggest that there is a difference 
between true average extensibility for 
high-quality fabric specimens and that 
for poor-quality specimens? 

c. The sample mean and standard devia- 
tion for the high-quality sample are 
1.508 and .444, respectively, and those 
for the poor-quality sample are 1.588 
and .530. Use the two-sample ¢ test 
to decide whether true average extensi- 
bility differs for the two types of 
fabric. 


Imaging of the colon with a contrast dye to 
evaluate for injury requires that the colon 
first be distended by pumping in carbon 
dioxide. The article “Determination of 
Normal Distribution of Distended Colon 
Volumes to Guide Performance of Colonic 
Imaging With Fluid Distention” (Curr. 
Probl. Diagn. Radiol. 2016: 185-188) 
reported that for a sample of 85 female 
patients undergoing this procedure, the 
mean colon length after distention was 
201.8 cm and the standard deviation was 
32.2 cm, whereas for a sample of 31 males 
the mean and standard deviation were 
180.2 cm and 38.6 cm, respectively. 


a. Carry out a test at significance level .1 
to decide whether true average length 
differs by sex (the article reported a 
P-value for this test). 

b. Construct and interpret 90% CI for the 
difference in true average colon length 
between the two sexes under these 
settings. Is your interval consistent 
with the result of the test in part (a)? 
Explain. 
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27. Research has shown that good hip range of 
motion and strength in throwing athletes 
results in improved performance and 
decreased body stress. The article “Func- 
tional Hip Characteristics of Baseball 
Pitchers and Position Players” (Am. 
J. Sport. Med. 2010: 383-388) reported on 
a study involving samples of 40 profes- 
sional pitchers and 40 professional position 
players. For the pitchers, the sample mean 
trail leg total arc of motion (degrees) was 
75.6 with a sample standard deviation of 
5.9, whereas the sample mean and sample 
standard deviation for position players were 
79.6 and 7.6, respectively. Assuming nor- 
mality, test appropriate hypotheses to 
decide whether true average range of 
motion for the pitchers is less than that for 
the position players (as hypothesized by the 
investigators). In reaching your conclusion, 
what type of error might you have 
committed? 


28. Tennis elbow is thought to be aggravated 
by the impact experienced when hitting the 
ball. The article “Forces on the Hand in the 
Tennis One-Handed Backhand”  (Jnt. 
J. Sport Biomech. 1991: 282—292) reported 
the force (Newtons) on the hand just after 
impact on a one-handed backhand drive for 
six advanced players and for eight inter- 
mediate players. 


Type of player Sample size Sample mean Sample SD 
1. Advanced 6 40.3 11.3 
2. Intermediate 8 21.4 8.3 


In their analysis of the data, the authors 
assumed that both force distributions were 
normal. Calculate a 95% CI for the differ- 
ence between true average force for 
advanced players (,;) and true average 
force for intermediate players (2). Does 
your interval provide compelling evidence 
for concluding that the two p’s are differ- 
ent? Would you have reached the same 
conclusion by calculating a CI for py — 1, 
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(i.e., by reversing the 1 and 2 labels on the 
two types of players)? Explain. 


29. As the population ages, there is increasing 
concern about accident-related injuries to 
the elderly. The article “Age and Gender 
Differences in Single-Step Recovery from a 
Forward Fall” (J. Gerontol A Biol. Sci. 
Med. Sci. 1999 54(1):M44—50) reported on 
an experiment in which the maximum lean 
angle—the farthest a subject is able to lean 
and still recover in one step—was deter- 
mined for both a sample of younger 
females (21-29 years) and a sample of 
older females (67-81 years). The following 
observations are consistent with summary 
data given in the article: 


Younger: 29, 34, 33, 27, 28, 32, 31,34, 32,27 
Older: 18, 15,23, 13,12 

Does the data suggest that true average 
maximum lean angle for older females is 
more than 10 degrees smaller than it is for 
younger females? State and test the relevant 
hypotheses at significance level. 10 by 
obtaining a P-value. 


30. The article “Effect of Internal Gas Pressure 
on the Compression Strength of Beverage 
Cans and Plastic Bottles” (J. Test. Eval. 
1993: 129-131) includes the accompanying 
data on compression strength (lb) for a 
sample of 12-o0z aluminum cans filled with 
strawberry drink and another sample filled 
with cola. Does the data suggest that the 
extra carbonation of cola results in a higher 
average compression strength? Base your 
answer on a P-value. What assumptions are 
necessary for your analysis? 


Beverage Sample Sample Sample 
size mean SD 

Strawberry 15 540 21 

drink 

Cola 15 554 15 


31. Which foams more when you pour it, Coke 
or Pepsi? Here are measurements by Diane 
Warfield on the foam volume (mL) after 
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WITS 62 
Traditional 64 


pouring a 12-0z can of Coke, based on a 
sample of 12 cans: 


312.2 292.6 331.7 355.1 362.9 331.7 
292.6 245.8 280.9 320.0 273.1 288.7 


and here are measurements for Pepsi, based 
on a sample of 12 cans: 


148.3 210.7 152.2 117.1 89.7 140.5 
128.8 167.8 156.1 136.6 124.9 136.6 


a. Verify graphically that normality is an 
appropriate assumption. 

b. Calculate a 99% confidence interval for 
the population difference in mean 
volumes. 

c. Does the upper limit of your interval in 
(b) give a 99% lower confidence bound 
for the difference between the two w’s? If 
not, calculate such a bound and interpret 
it in terms of the relationship between the 
foam volumes of Coke and Pepsi. 

d. Summarize in a sentence what you have 
learned about the foam volumes of 
Coke and Pepsi. 


In a comparative study conducted at Vir- 
ginia Tech, two Principles of Economics 
classes were run in an identical fashion 
except for one respect: one class used an 
interactive electronic teaching system 
(called WITS) for seven “research exer- 
cises,” while the other class discussed the 
research exercises but did not use the 
interactive devices. Final exam score results 
are summarized below (“Technology 
Improves Learning in Large Principles of 
Economics Classes: Using Our WITS,” 
Am. Econ. Rev. 2004: 442-446). 


Sample size Sample mean Sample SD 


T7145 11.1 
74.25 8.7 


a. Test the see whether the true mean final 
exam scores using WITS and using 
traditional instruction are different, at 
the « = .10 significance level. 
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b. Construct a 90% CI for the difference in 
true mean final exam score for WITS 
instruction and traditional instruction. Is 
your interval consistent with the test in 
part (a)? 

c. What does the interval in part (b) say 
about the practical significance of the 
test? 


The article “Characterization of Bearing 
Strength Factors in Pegged Timber Con- 
nections” (J. Struct. Engr. 1997: 326-332) 
gave the following summary data on pro- 
portional stress limits for specimens 
constructed using two different types of 
wood: 


Type of wood Sample size Sample mean Sample SD 
Red oak 14 8.48 79 
Douglas fir 10 6.65 1.28 
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Assuming that both samples were selected 
from normal distributions, carry out a test 
of hypotheses to decide whether the true 
average proportional stress limit for red oak 
joints exceeds that for Douglas fir joints by 
more than 1 MPa. 


According to the article “Fatigue Testing of 
Condoms” (Polym. Test. 2009: 567-571), 
“tests currently used for condoms are sur- 
rogates for the challenges they face in use,” 
including a test for holes, an inflation test, a 
package seal test, and tests of dimensions 
and lubricant quality (all fertile territory for 
the use of statistical methodology!). The 
investigators developed a new test that adds 
cyclic strain to a level well below breakage 
and determines the number of cycles to 
break. The cited article reported that for a 
sample of 20 natural latex condoms of a 
certain type, the sample mean and sample 
standard deviation of the number of cycles 
to break were 4358 and 2218, respectively, 
whereas a sample of 20 polyisoprene con- 
doms gave a sample mean and sample 
standard deviation of 5805 and 3990, 
respectively. Is there strong evidence for 
concluding that the true average number of 
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cycles to break for the polyisoprene con- 
dom exceeds that for the natural latex 
condom by more than 1000 cycles? [Note: 
The article presented the results of 
hypothesis tests based on the ¢ distribution; 
the validity of these depends on assuming 
normal population distributions. ] 


35. Exercise 22 from Chapter 9 gave the fol- 
lowing data on amount (02) of alcohol poured 
into a short, wide tumbler glass by a sample of 
experienced bartenders: 2.00, 1.78, 2.16, 
1.91, 1.70, 1.67, 1.83, 1.48. The cited article 
also gave summary data on the amount 
poured by a different sample of experienced 
bartenders into a tall, slender (highball) glass; 
the following observations are consistent with 
the reported summary data: 1.67, 1.57, 1.64, 
1.69, 1.74, 1.75, 1.70, 1.60. 


a. What does a comparative boxplot sug- 
gest about similarities and differences in 
the data? 

b. Carry out a test of hypotheses to decide 
whether the true average amount poured 
is different for the two types of glasses; 
be sure to check the validity of any 
assumptions necessary to your analysis, 
and report a P-value. 


36. Is the incidence of head or neck pain among 
video display terminal users related to the 
monitor angle (degrees from horizontal)? 
The paper, “An Analysis of VDT Monitor 
Placement and Daily Hours of Use for 
Female Bifocal Users” (Work 2003: 77-80), 
reported the accompanying data. Carry out 
an appropriate test of hypotheses (be sure to 
include a P-value in your analysis). 


Pain Sample size Sample mean Sample SD 
Yes 32 2.20 3.42 
No 40 3.20 2.52 


37. The article “Gender Differences in Indi- 
viduals with Comorbid Alcohol Depen- 
dence and Post-Traumatic Stress Disorder” 
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(Am. J. Addict. 2003: 412-423) reported the 
accompanying data on total score on the 
Obsessive-Compulsive Drinking Scale 


(OCSD). 
Gender Sample size Sample mean Sample SD 
Male 44 19.93 7.74 
Female 40 16.26 7.58 


Formulate hypotheses and carry out an 
appropriate analysis. Does your conclusion 
depend on whether a significance level of 
.05 or .01 was employed? (The cited paper 
reported P-value < .05; presumably .05 
would have been replaced by .01 if the 
P-value were really that small). 


38. Which factors are relevant to the time a 
consumer spends looking at a product on 
the shelf prior to selection? The article 
“Effects of Base Price upon Search 
Behavior of Consumers in a Supermarket” 
(J. Econ. Psychol. 2003: 637-652) reported 
the following data on elapsed time (sec) for 
fabric softener purchasers and washing-up 
liquid purchasers; the former product is 
significantly more expensive than the latter. 
These products were chosen because they 
are similar with respect to allocated shelf 
space and number of alternative brands. 


Product Sample Sample Sample 

size mean SD 
Fabric softener 15 30.47 19.15 
Washing-up 19 26.53 15.37 
liquid 


a. What if any assumptions are needed 
before an inferential procedure can be 
used to compare true average elapsed 
times? 

b. If just the two sample means had been 
reported, would they provide persuasive 
evidence for a significant difference 
between true average elapsed times for 
the two products? 
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c. Carry out an appropriate test of signifi- 
cance and state your conclusion. 


Let X1,...,Xm~N(fy,0) and Y,..., 
Y,~N(tb,0) be independent random 
samples from the specified normal popula- 
tion distributions (note that the population 
sd’s are equal). Let Se and hy denote the 
sample variance of the two samples, and 
define a pooled variance estimator of o by 


= m—-1 oo n—1 y 
P mtn—2°!') mt+n-2°? 


Show that (m+n — 2)S°/0? has a chi- 
squared distribution with m + n — 2 df. 
[Hint: Recall from Chapter 6__ that 
(m— 1)Si/o? ~ 72_, and similarly for the 
second sample variance. What is the dis- 
tribution of the sum of two independent 
chi-squared rvs?] 


Refer back to the scenario of the previous 
exercise. 


a. Verify that the standardized variable 


[(X — Y) — (a1 — be)|// 07 (1 /m + 1/n) 
has a standard normal distribution. 
b. Show that the pooled ¢ variable 


(X — Y) — (Hy — by) 


p— 


has a ¢ distribution with m+n — 2 df. 
[Hint: create a t-distributed variable 
using the standard normal rv from part 
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Fused 
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(a) and the chi-squared rv from the 
previous exercise. ] 


. Consider the pooled ¢ variable T,, from part 
(b) of the previous exercise. 


a. Use this ¢ variable to obtain a pooled 
t confidence interval formula for 
Hy — Hy. 

b. The article “Effect of Welding on a 
High-Density Polyethylene — Liner” 
(J. Mater. Civil Engr. 1996: 94-100) 
reported the following data on tensile 
strength (psi) of liner specimens both 
when a certain fusion process was used 
and when this process was not used. 


fusion 2748 2700 2655 2822 2511 3213 3220 2753 
3149 3257 
3027 3356 3359 3297 3125 2910 2889 2902 


Use the pooled ¢ formula from part (a) to 
estimate the difference between true 
average tensile strength for the two 
processes with a 95% confidence 
interval. 

c. Estimate the difference between the two 
L’s using the two-sample ¢ interval dis- 
cussed in this section, and compare it to 
the interval of part (b). 


. Refer to the previous two exercises. 
Describe the pooled f¢ test for testing 
Ho: fy — ly =9 when both population 
distributions are normal with 0; = o>. Then 
use this test procedure to test the hypothe- 
ses in Example 10.7. 
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In Sections 10.1 and 10.2, we considered estimating or testing for a difference between two means [1 
and ly. This was done by utilizing the results of a random sample X, ..., X,,, from the distribution 
with mean pu; and a completely independent (of the X’s) sample Y, ..., Y,, from the distribution with 
mean >. That is, either m individuals were selected from population | and n different individuals 
from population 2, or m individuals were given one treatment and another n individuals were given 
the other treatment. In contrast, there are a number of experimental situations in which there is only 
one set of n individuals or experimental objects, and two observations are made on each individual or 
object, resulting in a natural pairing of values. 


Example 10.9 Homes are typically appraised before sale. Appraisers hired by lenders such as banks 
have an incentive to assign a higher value to a house (so the home loan will be larger), while 
borrowers’ appraisers might be inclined to value the same house at a lower price. The article 
“Distressed Properties: Valuation Bias and Accuracy” (J. Real Estate Fin. Econ. 2010) describes a 
study in which a random sample of 20 residential properties being purchased in New Orleans after 
foreclosure was selected. Each property was appraised both by the borrower and by the lender, 
resulting in the following data (thousands of dollars). 


House 1 2 3 4 5 6 7 8 
Lender’s appraisal 24.3 31.1 108.5 20.0 58.2 23.6 38.7 54.2 
Borrower’s appraisal 18.6 21.8 98.1 10.2 50.2 15.7 29.8 45.5 
House 9 10 11 12 13 14 15 16 
Lender’s appraisal 21.3 145.3 123.4 171.0 41.2 123.1 47.4 26.1 
Borrower’s appraisal 14.6 135.8 111.4 156.5 31.2 109.7 39.7 18.6 
House 17 18 19 20 

Lender’s appraisal 76.9 52.5 101.2 33.6 

Borrower’s appraisal 67.5 42.2 90.0 26.4 


Figure 10.4 displays a plot of this data. At first glance, it appears that lenders’ appraisals are 
perhaps a little higher on average than borrowers’, but there is a great deal of variability in both 
samples. So, perhaps any differences between the samples can be attributed to this variability. 


Lender 


25 50 75 100 125 150 


Borrower 


House appraisal (thousands of dollars) 


Figure 10.4 Plot of original data from Example 10.9 


However, looking back at the original data, a clearer picture emerges: for every single house, the 
lender’s appraisal exceeds the borrower’s appraisal. Figure 10.5 displays the difference in appraised 
value (lender’s appraisal minus borrower’s appraisal) for these 20 homes. As we will see, a correct 
analysis of this data focuses on these differences. 
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e e e eee ee : eee e 2 J e e 
6.0 7.2 8.4 9.6 10.8 12.0 13.2 14.4 


Difference in appraisals (Lender minus Borrower, thousands of dollars) 


Figure 10.5 Plot of differences from Example 10.9 | 


ASSUMPTIONS _ The data consists of n independently selected pairs (X,, Y;), (X2, Y2), ..., (Xn Yn), 
with E(X;) = “, and E(Y;) = po. Let D) = X, — Y, ..., Dy =X, — Y,, so the D;’s 
are the differences within pairs. Then the D,’s are assumed to be normally dis- 
tributed with mean value 4p and standard deviation op. 


We are again interested in hypothesis testing or estimation for the difference 1, — ft. The denomi- 


nator of the two-sample f statistic was obtained by first applying the rule V(X — Y) = V(X)+V(Y). 
However, with paired data, the X and Y observations within each pair are often not independent, so X 
and Y are not independent of each other, and the rule is not valid. We must therefore abandon the two- 
sample ¢ procedures and look for an alternative method of analysis. 


A Confidence Interval for up 

Because different pairs are independent, the D;’s are independent of each other. If we let D = X — Y, 
where X and Y are the first and second observations, respectively, within a randomly selected pair, 
then the expected difference is 


Mp = E(X — Y) = E(X) — E(Y) = wy — by 


(recall that linearity of expectation is valid even when X and Y are dependent). Thus a confidence 
interval for Mp is equivalent to one for wu; — fy. An analogous comment applies to a test of 
hypotheses. But since the D,’s constitute a normal random sample (of differences) with mean sp, 
inferences about {ip can be performed using one-sample f procedures from Chapters 8 and 9. That is, 
to draw conclusions about [t, — fy when data is paired, form the differences D,, D,..., D, and carry 
out a one-sample t procedure, based on n — | df, on the D;,’s. 

Let D and Sp denote the sample mean and standard deviation, respectively, of the n paired dif- 
ferences D,, ..., D,. In the same way that the ¢ CI for a single population mean p is based on the 
t variable T = (X — p)/(S/\/n), a t confidence interval for 4p (= 4; — Hy) is based on the fact that 


_D-'t 
Sp/Jn 


has a ¢f distribution with n — | df. Manipulation of this ¢ variable, as in previous derivations of CIs, 
yields the following interval. 


T 


(10.4) 
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PAIRED The paired ¢ CI for fp with confidence level 100(1 — «)% has endpoints 
t INTERVAL 


= Sp 
d+ ty/2.n-1 eae’ a 


Vn 


where d and sp are the observed values of the sample mean and standard 
deviation of the paired differences. A one-sided confidence bound results 
from retaining the relevant sign (+ or —) and replacing t¢,,2 by t,. 


When 7 is small, the validity of this interval requires that the distribution of differences be at least 
approximately normal. For large n, the CLT ensures that the interval is at least approximately valid 
without any restrictions on the distribution of differences. 


Example 10.10 Adding computerized medical images to a database promises to provide great 
resources for physicians. However, there are other methods of obtaining such information, so the 
issue of efficiency of access needs to be investigated. The article “The Comparative Effectiveness of 
Conventional and Digital Image Libraries” (J. Audio Media Med. 2001: 8-15) reported on an 
experiment in which 13 computer-proficient medical professionals were timed both while retrieving 
an image from a library of slides and while retrieving the same image from a computer database with 
a Web front end. 


Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 
Slide 30 35 40 25 20 30 35 62 40 51 25 42 33 
Digital 25 16 15 15 10 20 ey 16 15 13 11 19 19 


Difference 5 19 25 10 10 10 28 46 25 38 14 23 14 


Let tp denote the true mean difference between slide retrieval time (sec) and digital retrieval time. 
Using the paired t confidence interval to estimate {1p requires that the difference distribution be at least 
approximately normal. The slight curvature in the normal probability plot from JMP (Figure 10.6) 
isn’t enough to invalidate the normality assumption. 
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Figure 10.6 Normal probability plot of the differences in Example 10.10 
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For the 13 differences, d = 20.5 and sp = 11.96. The ¢ critical value required for a 95% confi- 
dence level is t.o25,32 = 2.179, and the 95% CI is 


- 11.96 
d+ ty/on-1 =z = 20.5+2.179- Ta = 20.5 + 7.2 = (13.3, 27.7) 


Thus we can be highly confident (at the 95% confidence level) that 13.3 <p < 27.7. This 
interval of plausible values is rather wide, a consequence of the sample standard deviation being large 
relative to the sample mean. A sample size much larger than 13 would be required to estimate 
with substantially more precision. Notice, however, that 0 lies well outside the interval, suggesting 
that 4p > 0; this is confirmed by a formal hypothesis test. We can conclude from the experiment that 
computer retrieval appears to be faster on average. a 


The Paired t Test 

Hypothesis testing for paired data also involves calculating the n paired differences and working with 
that single sample of values. The 7 variable in (10.4) forms the basis for such tests. And since 
Lp = My — Hy, any hypothesis about the mean difference is equivalent to a hypothesis about the 
difference between means. 


PAIRED t TEST Null hypothesis: Ho: p_ = Ao 


d—A 
Test statistic value: t = : 
sp//n 
Alternative Hypothesis Rejection Region for Level « Test 
HA: Up > Ao t> ton—1 
A: Up < Ao t< = tyn-1 
Hi: Up # Ao either ¢ = ty /2,n—1 or t< = ty /2.n-1 


A P-value can be calculated as was done for earlier f tests. 


Example 10.11 Musculoskeletal neck-and-shoulder disorders are all too common among office staff 
who perform repetitive tasks using visual display units. The article “Upper-Arm Elevation During 
Office Work” (Ergonomics 1996: 1221-1230) reported on a study to determine whether more varied 
work conditions would have any impact on arm movement. The accompanying data was obtained 
from a sample of n = 16 subjects. Each observation is the amount of time, expressed as a proportion 
of total time observed, during which arm elevation was below 30°. The two measurements from each 
subject were obtained 18 months apart. During this period, work conditions were changed, and 
subjects were allowed to engage in a wider variety of work tasks. Does the data suggest that true 
average time during which elevation is below 30° differs after the change from what it was before the 
change? This particular angle is important because in Sweden, where the research was conducted, 
workers’ compensation regulations assert that arm elevation less than 30° is not harmful. 
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Subject 1 
Before 81 
After 78 
Difference 3 
Subject 9 
Before 74 
After 58 
Difference 16 


a 
87 
91 
=A 


10 
75 
62 
13 


4 
82 
78 

4 


12 
80 
58 
22 


6 7 
86 96 
67 92 
19 4 
14 15 
72 56 
60 65 
12 —9 
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Figure 10.7 shows a normal probability plot of the 16 differences; the pattern in the plot is quite 
straight, supporting the normality assumption. A boxplot of these differences appears in Figure 10.8; 
the box is located considerably to the right of zero, suggesting that perhaps up > 0 (note also that 13 
of the 16 differences are positive and only two are negative). 


Figure 10.7. A normal probability plot from Minitab of the differences in Example 10.11 
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Figure 10.8 A boxplot of the differences in Example 10.11 
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Let’s now use the recommended sequence of steps (refer back to Example 9.20) to test the 
appropriate hypotheses using the P-value method. 


1. 


Let 4p denote the true average difference between elevation time before the change in work 
conditions and time after the change. 


. Ho: Mp =O (there is no difference between true average time before the change and true 


average time after the change) 
Ay: Up # O (there is a difference) 


. The paired ¢ test requires data from a normal population. Figure 10.7 validates the plausibility 


of this assumption. 
/_d-0_ a 
~ Sp/¥n sp/vn 


. From the n = 16 differences, d = 6.75 and Sp = 8.234, so 


6.75 


t=———" ___ = 3.28233 
8.234/V/16 


. Appendix Table A.7 shows that the area to the right of 3.3 under the ¢ curve with 15 df is .002. 


The inequality in H, implies that a two-tailed test is appropriate, so the P-value is approxi- 
mately 2(.002) = .004 (software gives .0051). 


. Since .004 < .01, the null hypothesis can be rejected at either significance level .05 or .O1. It 


does appear that the true average difference between times is something other than zero; that is, 
true average time after the change is different from that before the change. Recalling that arm 
elevation should be kept under 30°, we can conclude that the situation became worse because 
the amount of time below 30° decreased. Hi 


Paired t Versus Two-Sample t Procedures 
Consider using the two-sample ¢ test on paired data. The numerators of the paired t and two-sample 
t test statistics are identical, since 


@=*S 0d ="S 0 (H yi) ="ox 1S y= 3-5 


The difference between the two test statistics is due entirely to the denominators. Each test statistic is 


obtained by standardizing X — Y (= D), but in the presence of dependence the two-sample ¢ stan- 
dardization is incorrect. To see this, recall from Section 5.3 that 


V(X +Y) = V(X) + V(Y) + 2Cov(X, Y) 


Since the correlation between X and Y is p = Corr(X, Y) = Cov(X, Y)/[,\/V(X) - /V(Y)], it follows 


that 


V(X — Y) = of +03 —2pai02 
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Applying this to X — Y yields 


2. 7 1 V(D;) _ of +93 — 2paiar 
xX — Y = D => — D; => => 
vie-¥)=vib)=v(2p) =" 


n 


The two-sample ¢ test is based on the assumption of independence, in which case p = 0. But in many 
paired experiments, there will be a strong positive dependence between X and Y (large X associated with 
large Y), so that p will be positive and the variance of X — Y will be smaller than o{/n+ o3/n. Thus 
whenever there is positive dependence within pairs, the denominator for the paired f statistic should be 
smaller than for t of the independent-samples test, resulting in a larger test statistic and a smaller P-value. 

Similarly, when data is paired, the paired ¢ CI will usually be narrower than the (incorrect) two-sample 
t CI. This is because there is typically much less variability in the differences than in the x and y values. 

The paired t and two-sample ¢ procedures described in this section and the previous section apply 
when we want to compare two populations, treatments, or conditions based upon a quantitative 
measurement (profit, sales, time, etc.). Many situations exist in which researchers can design their 
study using their choice of either “matched pairs” or two independent samples. However, once that 
design decision is made, only one analysis procedure is correct. In Examples 10.9-10.11, it would be 
wrong to use the two-sample t procedures from Section 10.2, since they are predicated on having two 
independent random samples of data. Similarly, even when investigators gather two independent 
samples of the same size (m =n), the paired ¢ procedures would not be appropriate because no natural 
pairing would exist between the individuals in sample #1 and the unrelated individuals in sample #2. 

Sometimes, as in our examples, paired data results from two observations being taken on the same 
individual or object. Even when this cannot be done, paired data with dependence within pairs can be 
obtained by matching on one or more characteristics thought to influence responses. For example, in a 
pharmaceutical study to compare the efficacy of two drugs for lowering blood pressure, the experi- 
menter’s budget might allow for the treatment of 100 patients. If 50 patients are randomly selected for 
treatment with the first drug and another 50 independently selected for treatment with the second 
drug, an independent-samples experiment results. 

However, the experimenter, knowing that blood pressure is influenced by age and weight, might 
decide to create pairs of patients so that within each of the resulting 50 pairs, age and weight were 
approximately equal (though there might be sizable differences between pairs). Then the two drugs 
would be randomly assigned to the subjects within each pair, for a total of 50 observations on each 
drug. The benefit of this matching (or “blocking’’) is that we can account for unwanted sources of 
variation (e.g., age and weight) that might otherwise have masked differences in the two treatments. 


Exercises: Section 10.3 (43-55) 


43. The Weaver—Dunn procedure with a fiber 
mesh tape augmentation is commonly used 
to treat AC joint (a joint in the shoulder) 
separations requiring surgery. The article 
“TightRope Versus Fiber Mesh Tape 
Augmentation of Acromioclavicular Joint 
Reconstruction” (Am. J. Sport Med. 2010: 
1204-1208) described the investigation of a 
new method which was hypothesized to 
provide superior stability (less movement) 
compared to the W-D procedure. The 


authors of the cited article kindly provided 
the accompanying data on anteposterior 
(forward—backward) movement (mm) for 
six matched pairs of shoulders: 


Subject 1 2 3 4 oS 6 


Fiber mesh 20 30 20 32 35 33 
TightRope 15 18 16 19 10 12 


Carry out a test of hypotheses at signifi- 
cance level .01 to see if true average 
movement for the TightRope treatment is 
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indeed less than that for the Fiber Mesh 
treatment. Be sure to check any assump- 
tions underlying your analysis. 


Hexavalent chromium has been identified 
as an inhalation carcinogen and an air toxin 
of concern in a number of different locales. 
The article “Airborne Hexavalent Chro- 
mium in Southwestern Ontario” (J. Air 
Waste Manage. 1997: 905-910) gave the 
accompanying data on both indoor and 
outdoor concentration (nanograms/m*) for 
a sample of houses selected from a certain 
region. 


House 1 2 3 4 5 6 7 8 9 


Indoor 07 08 09 12 12 12 13 14 15 
Outdoor .29 68 47 54 97 35 49 84 .86 


House 10 11 12 = 13 14 15 16 17 
Indoor S17) 7 18 18 18 18 19 


Outdoor 28 = .32 


House 18 19 20 21 22 23 24 = 25 


Indoor 20 22 22 © «©.23) 23) 25.2628 
Outdoor 1.59 90 52 .12 54 88 49 1.24 


House 26 27 28 29 30 31 32 33 


Indoor 28 29 34 39 40 45 54 62 
Outdoor 48 27) 37 «1.26 .70 .76 99  .36 


45. 


a. Calculate a confidence interval for the 
population mean difference between 
indoor and outdoor concentrations using a 
confidence level of 95%, and interpret the 
resulting interval. 

b. If a 34th house was to be randomly 
selected from the population, between 
what values would you predict the dif- 
ference in concentrations to lie? 


Shoveling is not exactly a high-tech activ- 
ity, but will continue to be a required task 
even in our information age. The article “A 
Shovel with a Perforated Blade Reduces 
Energy Expenditure Required for Digging 
Wet Clay” (Hum. Factors 2010: 492-502) 
reported on an experiment in which each of 
13 workers was provided with both a con- 
ventional shovel and a shovel whose blade 
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was perforated with small holes. The 
authors of the cited article provided the 
following data on stable energy expenditure 
[kcal/kg(subject)/Ib(clay)]: 


Worker 1 2 3 4 5 6 7 
Conventional .0011 .0014 .0018 .0022 .0010 .0016 .0028 
Perforated 0011 .0010 .0019 .0013 .0011 .0017 .0024 
Worker 8 9 10 11 12 13 


Conventional .0020 .0015 .0014 .0023 .0017 .0020 
Perforated 0020 .0013 .0013 .0017 .0015 .0013 


a. Calculate a confidence interval at the 
95% confidence level for the true aver- 
age difference between energy expen- 
diture for the conventional shovel and 
the perforated shovel (a normal proba- 
bility plot of the sample differences 
shows a reasonably linear pattern). 
Based on this interval, does it appear 
that the shovels differ with respect to 
true average energy expenditure? 
Explain. 

b. Carry out a test of hypotheses at sig- 
nificance level .05 to see if true average 
energy expenditure using the conven- 
tional shovel exceeds that using the 
perforated shovel; include a P-value in 
your analysis. 


46. The article “Effect of Wearable Technology 


Combined With a Lifestyle Intervention on 
Long-term Weight Loss” (JAMA 2017: 
1161-1171) describes a study in which 
adults in a large cohort were provided the 
same weight-loss regimen for six months. 
Then, participants were randomly assigned 
to either (1) self-monitor diet and physical 
activity using a Web site or (2) track diet 
and physical activity with a wearable 
device and accompanying Web interface. 
The weights of all subjects (in kg) were 
recorded at the beginning of the study and 
24 months later. The following summary is 
consistent with information in the article. 


10.3 


Analysis of Paired Data 


Treatment 


(1) 


(2) 


47. 


5.7 
12.0 


48. 


Baseline 24 Months Difference 


95.2 89.3 5.9 
16.4 15.1 6.9 
96.3 92.8 -3.5 
16.5 15.7 13 


n = 233 Mean: 
SD: 

n= 237 Mean: 
SD: 


a. Construct and interpret a 95% confi- 
dence interval for the population mean 
weight loss under the first treatment 
(self-monitoring with a Web site). 

b. Construct and interpret a 95% confi- 
dence interval for the population mean 
weight loss under the second treatment 
(tracking with a wearable device). 

c. How confident can you be that both the 
intervals in (a) and (b) contain the val- 
ues of population mean weight loss? 

d. Does the data show at the .05 level that 
the two population means in parts 
(a) and (b) are different? Perform the 
appropriate hypothesis test. [Note: This 
does not require a paired ¢ procedure!] 


Refer back to Example 10.9. Here are the 
differences in appraisals displayed in 
Figure 10.5: 


6.7 9.5 
11.2 7.2 


93 104 98 
14.5 10.0 


8.0 7.9 89 8.7 
13.4 7.7 7.5 9.4 10.3 


a. Construct a normal probability plot of 
these 20 differences. Is it plausible that 
the population distribution of differ- 
ences is normal? 

b. Construct and interpret a 95% upper 
confidence bound for the true mean 
difference in appraised home value. 

c. Test the hypothesis that the true mean 
difference in appraised home value is 
less than $10,000 at the .05 level. Is 
your answer consistent with part (b)? 


Management at a large retail appliance 
chain required all full-time sales staff to 
attend a one-day training session on 
improving sales technique. To evaluate the 
effectiveness of this rather expensive train- 
ing, the number of sales in the week prior to 
the training and the number of sales in the 


Salesperson 1 
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week following the training was recorded 
for each salesperson. Data for the 10 full- 
times salespersons at one branch of the 
store appears in the accompanying table. 


23 4 5 6 7 8 9 10 


After sales training 44 53 30 41 53 63 55 68 35 41 
Before sales training 50 45 25 40 45 55 40 54 33 49 


Difference 6 8 5 1 8 


49. 


8 15 14 2 -8 


Does the data provide convincing statistical 
evidence that, on average, employees make 
a greater number of weekly sales after the 
training? Test the appropriate hypotheses at 
the .01 level. Validate any necessary 
conditions. 


The article “Less Is Better: When Low- 
value Options Are Valued More Highly 
than High-value Options” (J. Behav. Decis. 
Making 1998: 107-121) describes several 
experiments pertaining to consumer 
behavior. In one experiment, 46 students 
were split into two groups: 23 who were 
shown 7 oz of ice cream in a 5-oz cup (the 
cup was overflowing) and 23 who were 
shown 8 oz of ice cream in a 10-0z cup 
(there was a lot of empty space left over). 
Each student was then asked, “What is the 
most you are willing to pay for a serving?” 
The researchers theorized that students 
would pay more, on average, for the 
overflowing cup even though it contained 
less ice cream. 


a. Which is the correct method of analysis 
for this situation: the paired f test, or the 
two-sample ¢ procedure from the previ- 
ous section? Why? 

b. The sample averages for the 7 oz and 
8 oz groups were $2.26 and $1.66, 
respectively; information in the article 
suggests the corresponding standard 
deviations are $0.84 and $0.81, respec- 
tively. Test the researchers’ hypothesis 
at the a = .05 level. Indicate any 
assumptions required for your method 
to be valid. 
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51. 


c. In a second experiment, a different 
group of 23 students were shown both 
of the aforementioned ice cream cups, 
side by side. Each student then indicated 
how much s/he was willing to pay for 
each ice cream serving. Which is the 
correct method of analysis for this sec- 
ond experiment: the paired f test, or the 
two-sample t procedure from the previ- 
ous section? Why? 

d. It is hypothesized that under this con- 
dition, students will be willing to pay 
more for the 8 oz serving. Test this 
hypothesis at the .05 level using the 
following information: sample average 
for 7 oz serving = $1.56; sample aver- 
age for 8 oz serving = $1.85; standard 
deviation of sample differences = $0.32. 
Indicate any assumptions required for 
your method to be valid. 

e. Can you explain why the two experi- 
ments gave “opposite” results? [Hint: 
This is not a statistics question.] 


The article discussed in the previous exercise 
also describes an experiment in which 35 
students were asked to price two boxes of 
silverware: a 24-piece box with all 24 pieces 
intact, and a 40-piece box with only 31 pie- 
ces of silverware intact. Each student indi- 
cated the amount s/he would be willing to 
pay for each box. The sample average 
amount students were willing to pay for the 
24-piece and 40-piece boxes were $29.70 
and $32.03, respectively. The standard 
deviation of the differences was $6.41. Test 
the hypothesis that, on average, students are 
willing to pay more for the box with more 


silverware even though it was not 
completely intact. Use a .05 level of 
significance. 


Chapter | Exercise 81 describes a study of 
children’s private speech (talking to them- 
selves). The 33 children were each 


20.8 
19.2 
34.0 
32.1 


28.8 
44.3 
46.5 
38.3 


52. 
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observed in about 100 ten-second intervals 
in the first grade, and again in the second 
and third grades. Because private speech 
occurs more in challenging circumstances, 
the children were observed while doing their 
mathematics. The speech was classified as 
on task (about the math lesson), off task, or 
mumbling (the observer could not tell what 
was said). Here are the 33 first-grade 
mumble scores, followed by the third- 
grade scores: 


244 194 33.3 26.0 566 39.5 24.7 21.6 
43.0 26.3 22.7 494 35.4 56.8 45.4 28.7 
26.9 484 27.6 52.6 5.9 38.5 22.1 22.2 
48.1 19.5 42.2 20.3 20.0 

57.0 23.9 46.9 50.0 646 542 55.3 214 
11.7 58.6 76.1 764 486 37.2 69.8 29.1 
50.0 69.6 69.8 59.4 22.7 849 42.0 67.2 
78.5 38.1 60.4 57.8 38.7 


The numbers are in the same order for each 
grade; for example, the third student 
mumbled in 19.4% of the intervals in the 
first grade and 23.9% of the intervals in the 
third grade. 


a. Verify graphically that normality is 
plausible for the population distribution 
of differences. 

b. Find a 95% confidence interval for the 
difference of population means, and 
interpret the result. 


Can people operate touch screen devices 
more quickly with their index finger or their 
thumb? Holding the device in landscape or 
portrait position? The article “Evaluation of 
a Psychomotor Vigilance Task (PVT) for 
Touch Screen Devices” (Hum. Factors 
2017: 661-670) describes a study in which 
13 participants performed a series of tasks 
on an iPod holding it two different ways: 
(1) in portrait position using their index 
finger and (2) in landscape position using 
their thumb. The median response time, in 
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53. 


54. 


milliseconds, was recorded for each par- 
ticipant under both settings. Those obser- 


vations are summarized in _ the 
accompanying table. 
iPod position Mean SD 
Portrait/index 224.09 39.30 
Landscape/thumb 211.50 30.74 
Difference 12.59 15.28 


Test whether the results under the two 
positions are significantly different at the 
.05 significance level. 


It has been estimated that between 1945 
and 1971, as many as 2 million children 
were born to mothers treated with diethyl- 
stilbestrol (DES), a nonsteroidal estrogen 
recommended for pregnancy maintenance. 
The FDA banned this drug in 1971 because 
research indicated a link with the incidence 
of cervical cancer. The article “Effects of 
Prenatal Exposure to  Diethylstilbestrol 
(DES) on Hemispheric Laterality and Spa- 
tial Ability in Human Males” (Hormones 
Behav. 1992: 62—75) discussed a study in 
which 10 males exposed to DES and their 
unexposed brothers underwent various 
tests. This is the summary data on the 
results of a spatial ability test: x = 12.6 
(exposed), y = 13.7, and standard error of 
mean difference = .5. Test at level .05 to 
see whether exposure is associated with 
reduced spatial ability by obtaining the 
P-value. 

As integrated circuits operate at ever- 
smaller resolutions, the clean handling of 
wafers in the manufacturing process has 
become even more important. The article 
“Particle Free Handling of Substrates” 


Pre 
Post 
Diff. 
Pre 
Post 
Diff. 


55. 
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(IEEE Trans. Semicond. Manuf. 2016: 
314-319) provides the following data on 
the number of particles detected pre- and 
post-handling for a sample of 16 wafers: 


5 5 32 14 2 2 17 13 
236 684 1256 3605 40 92 173 44 
231 679 1224 3591 38 90 156 31 

18 27 3 18 52 20 17 35 
51 88 189 610 124 1218 2023 3057 
33 61 184 592 72 1198 2006 3022 


a. The researchers desired a confidence 
interval for Up, the average increase in 
number of particles per wafer due to 
handling. Why should the paired ¢ in- 
terval not be applied here? [Hint: Con- 
struct a normal probability plot of the 
differences. | 

b. A normal probability plot of the loga- 
rithms of the difference values shows 
that the population of In(D) values is 
plausibly normal (i.e., D may be log- 
normal). Take the logarithm of the dif- 
ferences, and use those values to 
construct a 95% CI for E[In(D)]. 

c. It can be shown that exponentiating the 
endpoints of the interval from part 
(b) produces a confidence interval not 
for [p, but rather the population median 
lp. Exponentiate the limits of the 
interval from part (b), and interpret this 
interval. 


Construct a paired data set for which t = ©0, 
so that the data is highly significant when 
the correct analysis is used, yet ¢ for the 
two-sample f test is quite near zero, so the 
incorrect analysis yields an insignificant 
result. 
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10.4 Inferences About Two Population Proportions 


Having presented methods for comparing the means of two different populations, we now turn to the 
comparison of two population proportions. The notation for this scenario is an extension of the 
notation used in the corresponding one-population problem. Let p, and p2 denote the proportions of 
individuals in populations | and 2, respectively, who possess a particular characteristic. Equivalently, 
if we use the label S (success) for an individual who possesses the characteristic of interest—favors a 
particular proposition, has read at least one book within the last month, etc_—then p, and p2 represent 
the probabilities of seeing the label S on a randomly chosen individual from populations | and 2, 
respectively. 

We will assume the availability of a sample of m individuals from the first population and n from 
the second. The variables X and Y will represent the number of individuals in each sample possessing 
the characteristic that defines p; and p2. Provided the population sizes are much larger than the sample 
sizes, the distribution of X can be taken to be binomial with parameters m and p,, and similarly, 
Y ~ Bin(m, p2). Furthermore, the samples are assumed to be independent of each other, so that 
X and Y are independent rvs. 

The obvious estimator for p; — p2, the difference in population proportions, is the corresponding 
difference in sample proportions. With Pi =X /m and Py = Y/n, the estimator of py — p2 can be 
expressed as P; — Py = X/m—Y/n. 


PROPOSITION Let X ~ Bin(m, p,) and Y ~ Bin(n, p2) with X and Y independent variables. 
Define P; = X/m and P, = Y/n. Then 
E(P, — P2) = pi — po, 
Ne) P, — P, is an unbiased estimator of p, — p2, and 


V(P1 — P2) aren + a (where g; = 1 — pi) 


Proof Since E(X) = mp, and E(Y) = npo, 


xX YY 1 1 1 1 
E = E(X) E(Y) =— mp, — —Np2 = Pi — p2 
mi on m n m n 


Since V(X) = mpiq,, V(Y) = np2q2, and X and Y are independent, 


v(2-2) =v(2) +-nv(Z) = Svat sve) -22 + BP " 


m n m n m m n 


We will focus first on situations in which both m and n are large. Then because P, and Py individually 
have approximately normal distributions, the estimator P, — P, also has approximately a normal 


distribution. Standardizing P,; — P, yields a variable Z whose distribution is approximately standard 
normal: 
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Pi —P2— (pi — pr) 
Pi 4 P22 
m n 


Z 


A Large-Sample Test Procedure 

Analogous to the hypotheses for 4, — fl, the most general null hypothesis an investigator might 
consider would be of the form Ho: p; — p2 = Ao, where Ap is again a specified number. Although for 
population means the case Ay ¥ 0 presented no difficulties, for population proportions the cases Ag = 0 
and Ay # 0 must be considered separately. Since the vast majority of actual problems of this sort involve 
Ao = 0(i.e., the null hypothesis p; = p2), we will concentrate on this case. When Ho: p; — p2 = Ois true, 
let p denote the common value of p; and p> (and similarly for g). Then the standardized variable 


P,—P,—-0 


1 1 
pq{— +— 
m n 


has approximately a standard normal distribution when Hp is true. However, this Z cannot serve as a 
test statistic because the value of p is unknown—AH) asserts only that there is a common value of p, 
but Hp does not say what that value is. To obtain a usable test statistic having approximately a 
standard normal distribution when Ho is true, p must be estimated from the sample data. 

Assuming then that p; = p2 = p, instead of separate samples of size m and n from two different 
populations (two different binomial distributions), we really have a single sample of size m + n from 
one population with proportion p. Since the total number of individuals in this combined sample 
having the characteristic of interest is X + Y, the estimator of p is 


X4+Y : ; 
eee eae) (10.5) 


P = 


m+n m+n . m+n 


The second expression for P shows that it is actually a weighted average of estimators P, and P, 


obtained from the two samples. If we take (10.5) and substitute back into Z with O =1-—P, the 
resulting statistic has approximately a M(0, 1) distribution when Ho is true. 


TWO-PROPORTION Null hypothesis: Ho: p; — p2 = 0 


z TEST Test statistic value (large samples): z = 


Alternative Hypothesis Rejection Region for Approximate Level « Test 


H,:p, —p2 > 0 Z > %y 
Ay: Pp, —pr2< 0 ZS —2y 
Ay: pi —p2 # 0 either z > 2/2 OZ < — Zy/2 


A P-value is calculated in the same way as for previous z tests. 
These procedures are valid provided that np, > 10, n,qg, > 10, nop. > 10, 
and N2G2 > 10. 


Example 10.12 Are customers more willing to buy a product or service just because they’re offered 
multiple purchase plans? In a study published in Land Econ. (2006), home-owning residents of 
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Madison, WI were offered the chance to buy wind-generated electricity for their homes as part of a 
pilot program in the city. A random sample of 237 residents was offered seven different purchase 
plans (ranging from 50 kWh to 600 kWh per month; the extra monthly cost of wind-generated power 
was about $1 for 25 kWh). An independent random sample of 649 residents was offered just a single 
choice for the amount they could purchase, with that amount randomly selected for each resident from 
among the same seven options. Thirty-seven percent of those offered multiple options purchased 
wind-generated electricity, compared to 24% of the single-option group. 

Does this data suggest that customers are more willing to buy wind-generated electricity when 
they’re offered multiple purchase plans, or could the disparity be attributed to chance? Test at the 
a = .O1 level. 


1. The population being studied consists of all homeowners in Madison, WI. Within this popu- 
lation, the parameter of interest is p, — p2, the difference in the proportions who would buy 
wind-generated electricity under the multiple-option scheme and the single-option scheme. 

2. Researchers believed a priori that customers were more likely to buy in under the first scheme, 
so the competing hypotheses are 


Ho: pi — p2 = 0 (ie., pi = p2) 
Ay: pi — pr > 0 (ie, pi > pr) 


3. The data consists of two independent random samples. From the numbers provided, 
np, = (237)(.37) © 88, nq, © 149, nop2 = (649)(.24) & 156, and nog. ~ 493; all of these 
values are at least 10. The requirements for this large-sample z hypothesis test are satisfied. 


4. The two-proportion z test statistic value is z= eee es 


5. For this upper-tailed test, we reject Hp if z > Zo, = 2.33. 
6. The combined (pooled) estimate of the common proportion p under Ho is 


237 649 


p = ———— (.37 ——— (.24) = .275 
P= 5374 6493) + 5375 649 074) ; 
which results in a test statistic value of 
37) — (.24 
z= a eens = 3.84 


{| c2rs09 = + aa| 


That is, the observed value of P, — P, is almost four standard deviations larger than what we’d 
expect if Hp were true. 


7. Since 3.84 > 2.33, Ho is rejected at the .01 significance level. The data very strongly suggests 
that Madison homeowners are more likely to purchase wind-generated electricity if they are 
offered several options for the amount of electricity they can buy. 

Using a P-value approach, based on the direction of H,, the P-value equals 1 — ®(3.84) = .0003. If 
residents were equally likely to buy wind-generated electricity under both schemes, the chance of 
observing a disparity at least as large as the one in this study would be extremely small. Hence, again, 
we would reject Ho in favor of the alternative hypothesis. a 
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Power, f, and Sample Sizes 

Here the determination of power and f is a bit more cumbersome than it was for other large-sample 
tests. The reason is that the denominator of Z is an estimate of the standard deviation of P, _ P, 
assuming that p; = p> = p. When Hy is false, P; — Py must be re-standardized using 


Pig ie P2q2 (10.6) 


ARhe m n 


The form of o in (10.6) implies that power and f are functions of both p; and p2, not just the 
difference p; — p2. So we denote the chance of a type II error by B(p1, pz). 


Alternative Hypothesis BQ, p2) 


—f1 1 
Za4/DG| — + —]} — (p1 — pr) 
m n 
© 
oO 


H,: p, — p2 > 90 
1 1 
Za pa( > t -) (pi — p2) 
H,: py — p2 <0 t=) = 
fil 4 1 1 
2/24/Pa\ — + — ) — (pi — pa) —2x/24/PG\ — + = ) — (Pi — Pa) 
Hy: pi — p2 # 0 * e 2 ; 


where p = (mp; + np2)/(m+n), gq = (mqi +nq2)/(m-+n), and o is given by (10.6). 
For each case, power = 1 — f. 


Alternatively, for specified p,; and p2, the sample sizes necessary to achieve f(p;, p2) = fh can be 
determined. For example, for the upper-tailed test, we equate —z, to the argument of @(-) (.e., what’s 
inside the parentheses) in the foregoing box. Ifm = n, there is a simple expression for the common value: 


ie [ex/ (pr + Pain + 42)/2 + zp/Pigi + Page] (10.7) 
(Pi — pr)” | 


for an upper- or lower-tailed test, with «/2 replacing « for a two-tailed test. 


Example 10.13 One of the truly impressive applications of statistics occurred in connection with the 
design of the 1954 Salk polio vaccine experiment and analysis of the resulting data. Part of the 
experiment focused on the efficacy of the vaccine in combating paralytic polio. Because it was 
thought that without a control group of children, there would be no sound basis for assessment of the 
vaccine, it was decided to administer the vaccine to one group and a placebo injection (visually 
indistinguishable from the vaccine but known to have no effect) to a control group. For ethical reasons 
and also because it was thought that the knowledge of vaccine administration might have an effect on 
treatment and diagnosis, the experiment was conducted in a double-blind manner. That is, neither the 
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individuals receiving injections nor those administering them actually knew who was receiving 
vaccine and who was receiving the placebo (samples were numerically coded)—remember, at that 
point it was not at all clear whether the vaccine was beneficial. 

Let p; and p2 be the probabilities of a child getting paralytic polio for the control and treatment 
conditions, respectively. The objective was to test Ho: py — p2 =0 versus Hy: p1 — p2 > 0 (the 
alternative hypothesis states that a vaccinated child is less likely to contract polio than an unvacci- 
nated child). Supposing the true value of p,; is .0003 (an incidence rate of 30 per 100,000), the vaccine 
would be a significant improvement if the incidence rate was halved—that is, p. = .00015. Using a 
level « = .05 test, it would then be reasonable to ask for sample sizes for which power = 90% (i.e., 
f = .1) when p, = .0003 and pz = .00015. Assuming equal sample sizes, the required n is obtained 
from (10.7) as 


[1.645 ,/(.5)(.00045) (.199955) + 1.28 \/(.00015)(.99985) + (.0003)(.9997)|" 
(.0003 — .00015)* 
= [(.0349 + .0271)/.00015]” 171,000 


The actual data for this experiment follows. Sample sizes of approximately 200,000 were used. The 
reader can easily verify that z = 6.43, a highly significant value. The vaccine was judged a resounding 
success! 


Placebo: m = 201,229 x = number of cases of paralytic polio = 110 
Vaccine: n= 200,745 y= 33 a 


A Large-Sample Confidence Interval for p, — p2 

As with means, many two-sample problems involve the objective of comparison through hypothesis 
testing, but sometimes an interval estimate for p, — p> is appropriate. Both P; = X /m and P,=Y /n 
have approximate normal distributions when m and n are both large. If we identify 0 with p, — po, 
then 0 = P, — Py satisfies the conditions necessary for obtaining a large-sample CI (see Section 9.6). 
In particular, the estimated standard deviation of @ is J (pigi/m) + (B2G2/n). The 100(1 — «)% 


interval 0 + z, /2* Gj then becomes the two-proportion z interval 


as A Pi 
Pi — P2 + Za/2 oe 
m 


Like the two-proportion z test, this formula is suitable for large samples. Notice that the estimated 
standard deviation of p; — p2 (the square root expression) is different here from what it was for 
hypothesis testing when Ag = 0. 

Statistical research has shown that the actual confidence level for the two-proportion z CI can 
sometimes deviate substantially from the nominal level (the level you think you are getting when you 
use a particular z critical value—e.g., 95% when z,/2 = 1.96). A suggested improvement is to add one 
success and one failure to each of the two samples and then replace the p’s and q’s in the foregoing 
formula by p’s and q’s where p, = (x+1)/(m-+2), etc. This adjusted interval can also be used 
reliably when sample sizes are quite small. 
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Example 10.14 Agritourism (visiting farms and participating in farm activities) is an increasingly 
large part of the American tourism industry. The authors of “Examination of the Use of E-Marketing 
by Small Farms in the Northeast,” (J. Food Distrib. Res. 2006 37(1)) investigated the relationship 
between agritourism and presence on the Internet. In a survey of 640 farms in the northeastern United 
States, 261 farms had a Web site for their farm business and 379 did not. Among farms with a Web 
site, 167 had some type of agritourism activities, compared to 152 of the farms that did not have a 
business Web site. 

Let p, = the proportion of all northeastern farms with Web sites that provide agritourism activities 
and p> = the proportion of all northeastern farms without Web sites that provide agritourism activities. 
With p; = 167/261 = .640 and py = 152/379 = .401, a 95% confidence interval for p; — p2 is 


640(.360)  .401(.599) 
261 379 


(.640 — .401) + 1.964) .239 + .076 = (.163, .315) 


We are 95% confident that the difference in the proportion of northeastern farms with Web sites that 
offer agritourism activities and the proportion of northeastern farms without Web sites that offer 
agritourism activities is between .163 and .315. (Using p; = 168/263 and jp. = 153/381 based on 
sample sizes of 263 and 381, respectively, the adjusted interval here is essentially identical to the 
original interval.) 

In particular, the agritourism rate is much higher among those farms which use a business Web site 
to advertise. We observe here a positive association between having a Web site and providing 
agritourism activities. One caveat: since this is only an observational study, we cannot conclude that 
presence on the Internet causes farms to make money off agritourism. a 


Small-Sample Inferences 

On occasion an inference concerning p, — p2 may have to be based on samples for which at least one 
sample size is small. Appropriate methods for such situations are not as straightforward as those for 
large samples, and there is less agreement among statisticians as to recommended procedures. 

The main issue here is that P; — P) is no longer approximately normal when m or n is small, and 
no expression exists for the exact distribution of the difference of two (scaled) binomial rvs. Some 
software packages will “adjust” the data by adding one success and one failure to each sample, as was 
mentioned briefly in the context of confidence intervals. Alternatively, statistical software can be used 
to simulate the sampling distribution of P, — Py using the underlying binomial models, and P-values 
can be estimated therefrom. 

One frequently used test in this situation, called Fisher ’s exact test, is based on the hypergeometric 
distribution. This method has its own deficiencies, as it assumes that both the sample sizes and the 
total number of successes across the two samples are fixed. But Fisher’s exact test is the most- 
commonly used alternative procedure for comparing two proportions from small samples. Please 
consult an appropriate reference for more information. 
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Exercises: Section 10.4 (56—70) 


56. 


57. 


Independent random samples of 237 
African-Americans and 396 Caucasian- 
Americans were asked to identify their 
“favorite” category of television program- 
ming (“Television Types and TV Attitudes 
of African-Americans, Latinos, and Cau- 
casians,” J. Advert. Res. 2008: 235-246). In 
the survey, 13.1% of African-Americans 
and 18.7% of Caucasians indicated that 
they prefer to watch the news more than 
any other category. 


a. Test the hypothesis that the proportion 
of all people whose favorite TV viewing 
category is news differs between the 
populations of all African-Americans 
and Caucasians, at the « = .01 signifi- 
cance level. 

b. Would your answer to part (a) be dif- 
ferent if you used a .10 significance 
level? Explain. 

c. The same survey found that 33.3% of 
213 randomly selected Latinos chose 
the news as their favorite television 
program. Repeat part (a), but compare 
the populations of Latinos and African- 
Americans. 


A sample of 300 urban adult residents of a 
particular state revealed 63 who favored 
increasing the highway speed limit from 55 
to 65 mph, whereas a sample of 180 rural 
residents yielded 75 who favored the 
increase. Does this data indicate that the 
sentiment for increasing the speed limit is 
different for the two groups of residents? 


a. Test Ho: pi = p2 versus H,: pi # po 
using « = .05, where p, refers to the 
urban population. 

b. If the true proportions favoring the 
increase are actually p, = .20 (urban) 
and p2=.40 (rural), what is the 
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59. 


60. 
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probability that Ho will be rejected 
using a level .05 test with m = 300, 
n= 180? 


It is thought that the front cover and the 
nature of the first question on mail surveys 
influence the response rate. The article 
“The Impact of Cover Design and First 
Questions on Response Rates for a Mail 
Survey of Skydivers” (Leisure Sci. 1991: 
67-76) tested this theory by experimenting 
with different cover designs. One cover was 
plain; the other used a picture of a skydiver. 
The researchers speculated that the return 
rate would be lower for the plain cover. 


Cover Number sent Number returned 
Plain 207 104 
Skydriver 213 109 


Does this data support the researchers’ 
hypothesis? Test the relevant hypotheses 
using a = .10 by first calculating a P-value. 


Do teachers find their work rewarding and 
satisfying? The article “Work-Related Atti- 
tudes” (Psych. Rep. 1991: 443-450) reports 
the results of a survey of 395 elementary 
school teachers and 266 high school teach- 
ers. Of the elementary school teachers, 224 
said they were very satisfied with their jobs, 
whereas 126 of the high school teachers 
were very satisfied with their work. Esti- 
mate the difference between the proportion 
of all elementary school teachers who are 
satisfied and all high school teachers who 
are satisfied by calculating a CI. 


Several states have an annual “sales tax 
holiday” to encourage spending. A survey 
of 695 randomly selected shoppers at a 
large retail center in Texas asked how 
important people felt the tax holiday was 
(Am. J. Bus. 2007). 565 shoppers indicated 
that the tax holiday was important in their 
decision to shop. 
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a. Estimate the proportion of all Texas 
shoppers for whom the tax holiday is 
important in their decision to shop. 

b. In the same study, 195 of 250 men said 
the tax holiday was important in their 
decision to shop, compared to 370 of 
445 women. Test the hypothesis that 
women are more likely than men to 
consider the tax holiday important, at 
the 5% significance level. 


The author of the article “Food and Eating 
on Television: Impacts and Influences” 
(Nutrition Food Sci. 2000: 24-29) exam- 
ined hundreds of hours of BBC television 
footage and categorized food images for 
both TV programs and commercials. Out of 
1785 food images in TV programs, 322 
showed sugary and/or fatty foods, while 
511 out of 1186 commercial food images 
were sugary and/or fatty. 


a. Construct a 99% CI for the difference in 
the proportion of food images that 
include sugary/fatty foods in TV pro- 
grams and in commercials. Assume 
these two samples are representative of 
all food images on the BBC. 

b. What does the CI in part (a) say about 
the disparity between food images in 
TV programs and those in commercials? 


The authors of the article “Adjuvant 
Radiotherapy and Chemotherapy in Node- 
Positive Premenopausal Women with 
Breast Cancer” (New Engl. J. Med. 1997: 
956-962) reported on the results of an 
experiment designed to compare treating 
cancer patients with only chemotherapy to 
treatment with a combination of 
chemotherapy and radiation. Of the 154 
individuals who received the 
chemotherapy-only treatment, 76 survived 
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at least 15 years, whereas 98 of the 164 
patients who received the hybrid treatment 
survived at least that long. 


a. With p; denoting the proportion of all 
such women who, when treated with 
just chemotherapy, survive at least 
15 years and p2 denoting the analogous 
proportion for the hybrid treatment, 
calculate a 99% CI for p; — po. 

b. Based on the interval from part (a), can 
either treatment be judged superior to 
the other? Why or why not? 


The article “Luck of the Draw: Creating 
Chinese Brand Names” (J. Advertising Res. 
2008: 523-530) explores the use of “lucky” 
brand names in China, and whether that use 
varies by the uncertainty in a brand’s 
business environment (its market sector). In 
a sample of 654 brands from sectors with 
low uncertainty, 372 names were consid- 
ered lucky; among 548 “high-uncertainty” 
brands, 343 had lucky names. (In Chinese 
culture, the number of strokes required to 
write a name determines its luck.) The 
authors of the article theorized that com- 
panies would use “lucky” brand names 
more often in high-uncertainty business 
environments. Test the authors’ hypothesis 
at the « = .05 significance level. [Hint: The 
two populations are all Chinese brands in 
low-uncertainty business environments and 
all Chinese brands in high-uncertainty 
business environments. | 


Air travelers often complain that recircu- 
lated air in the cabin leads to the spread of 
colds, while the airline industry generally 
disputes this claim. A 2002 study in the 
J. Am. Med. Assoc. reported the following 
information for passengers on flights with 
and without recirculated air: 
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Post-flight respiratory 


symptoms? 
Recirculated air Yes No 
on the flight? 
Yes 111 472 
No 108 409 


65. 


66. 


Assume these represent independent ran- 
dom samples from the two relevant popu- 
lations. Does the data suggest that the 
likelihood of catching a cold is higher on 
flights with recirculated air? Test at the a = 
.05 significance level. 


In a 2017 class project, two team mem- 
bers each approached 50 students on 
campus (randomly selected using system- 
atic sampling). Each student was asked to 
participate in a survey, but the survey 
itself was a ruse: the real goal was to see 
who would agree to be surveyed by 
Melissa (who has a British accent) or 
Kristine (an American accent). In the end, 
41 of 50 students agreed to be surveyed 
by Melissa, while 27 of 50 took Kristine’s 
(fake) survey. 


a. Test the hypothesis of equal population 
proportions at the .01 significance level. 
Find the P-value for the test, and inter- 
pret your results. Be sure to clearly 
define your parameters! 

b. Can it be concluded that there is a 
causal relationship between _inter- 
viewer’s accent and willingness to be 
surveyed? Explain. 


Statin drugs are used to decrease choles- 
terol levels and therefore hopefully to 
decrease the chances of a heart attack. In a 
British study (““MRC/BHF Heart Protection 
Study of Cholesterol Lowering with Sim- 
vastin in 20,536 High-Risk Individuals: A 
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Randomized Placebo-Controlled Trial,” 
Lancet 2002: 7-22) 20,536 at-risk adults 
were assigned randomly to take either a 40- 
mg statin pill or placebo. The subjects had 
coronary disease, artery blockage, or dia- 
betes. After 5 years there were 1328 deaths 
(587 from heart attack) among the 10,269 
in the statin group and 1507 deaths (707 
from heart attack) among the 10,267 in the 
placebo group. 


a. Give a 95% confidence interval for the 
difference in population death 
proportions. 

b. Give a 95% confidence interval for the 
difference in population heart attack 
death proportions. 

c. Is it reasonable to say that most of the 
difference in death proportions is due to 
heart attacks, as would be expected? 


Using the traditional formula, a 95% CI for 
P1 — P2 is to be constructed based on equal 
sample sizes from the two populations. For 
what value of n (= m) will the resulting 
interval have width at most .1 irrespective 
of the results of the sampling? 


In medical investigations, the ratio 0 = 
P\/p2 is often of more interest than the 
difference p, — p> (e.g., individuals given 
treatment | are how many times as likely to 
recover as those given treatment 27). Let 


6= P,/Po. When m and n are both large, 


the statistic In(@) has approximately a nor- 
mal distribution with approximate mean 
value In(@) and approximate standard 
deviation [(m — x)/(mx) + (n — y)(ny)]'”. 


a. Use these facts to obtain a large-sample 
95% CI formula for estimating In(0), 
and then a CI for 0 itself. 
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b. The article “Low-Dose Aspirin for 
Preventing Recurrent Venous Throm- 
boembolism (VT)” (New Engl. J. Med. 
2012: 1979-1987) reports a study in 
which VT recurred in 73 of 411 patients 
randomly assigned a placebo and in 57 
of the 411 assigned to an aspirin regi- 
men. Calculate an interval of plausible 
values for @ at the 95% confidence level. 
What does this interval suggest about 
the efficacy of the aspirin treatment? 


All the examples of this section featured 
success/failure data from two independent 
samples. McNemar’s Test handles paired 
binary responses. For example, suppose 
that before a major policy speech by a 
political candidate, n individuals are selec- 
ted and asked whether (S) or not (F) they 
favor the candidate. Then after the speech 
the same n people are asked the same 
question. The responses can be entered in a 
table as follows: 


After 
S F 
S| XxX, a) 
Before 
F\ 43 X 


where X; + Xz + X3 + X, =n. Let pj, po, 
P3, and p4 denote the four cell probabilities, 
so that p; = P(S before and S after), and so 
on. We wish to test the hypothesis that the 
true proportion of supporters (S) after the 
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speech has not increased against the alter- 
native that it has increased. 


a. State the two hypotheses of interest in 
terms of p1, P2, p3, and p4. 

b. Construct an estimator for the after/before 
difference in success probabilities. 

c. When nis large, it can be shown that the 
random variable (X; — X;)/n has approx- 
imately a normal distribution with vari- 
ance [p; + pj; — (Pi = pjy Vn. Construct a 
test statistic with approximately a stan- 
dard normal distribution when Hp is true. 

d. If x; = 350, x2 = 150, x3 = 200, and 
x4 = 300, what do you conclude? 


McNemar’s test, developed in the previous 
exercise, can also be used when individuals 
are “matched” to yield pairs and then one 
member of each pair is given treatment | 
and the other is given treatment 2. Then X, 
is the number of pairs in which both treat- 
ments were successful, and similarly for X>, 
X3, and X4. Suppose the following data is 
obtained from such a matched pairs design 
to assess the effectiveness of a certain 
migraine headache medicine. Use McNe- 
mar’s test to determine whether the medi- 


cine is effective in the treatment of 
migraines. 
Medicine 
Ss F 
Placebo 2 ie a 
F 46 30 


Inferences About Two Population Variances 


Methods for comparing two population variances (or standard deviations) are occasionally needed, 
though such problems arise much less frequently than those involving means or proportions. For the 
case in which the populations under investigation are normal, the procedures are based on the 
F distribution from Sections 6.3 and 6.4. 
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Testing Hypotheses 
A test procedure for hypotheses concerning the ratio a, /a5, as well as a CI for this ratio, is based on 
the following result from Section 6.4. 


THEOREM Let X,,...,X,, be a random sample from a normal distribution with standard deviation 
o,, let Y,,..., ¥, be another random sample (independent of the X;’s) from a normal 
distribution with standard deviation o2, and let S; and Sj denote the two sample 
standard deviations. Then the rv 


_ S/oy 
53/05 


(10.8) 


has an F distribution with vy} =m — 1 andv,=n-1. 


Under the null hypothesis of equal population standard deviations, (10.8) reduces to the ratio of 
sample variances. For a test statistic we use this ratio of sample variances, and the claim that ¢, = a, 
is rejected if the ratio differs by too much from 1. 


THE F TEST FOR 
EQUALITY OF Null hypothesis: Ho: ¢, = o5(equivalently, o7 = 03) 


VARIANCES Test statistic value: f = st/s5 


Alternative Hypothesis Rejection Region for a Level « Test 


HA: on > 05 f = Fy m-1n-1 
HA: on < (ey f < Fix m-1n-1 
Hy: 0; 4 O4 either f > Fypmin1 Of S Firman 


Since critical values are tabled only for « = .10, .05, .01, and .001 in Appendix Table A.8, the two- 
tailed test can be performed only at levels .20, .10, .02, and .002 without statistical software. 


Example 10.15 Is there less variation in weights of some baked goods than others? Here are the 
weights (in grams) for a sample of Bruegger’s bagels and another sample of Wolferman’s English 
muffins: 


B: 99.8 105.4 94.7 107.8 114.3 106.3 
W: 99.0 98.2 98.1 102.1 102.9 104.1 98.8 99.5 


The normality assumption is very important for the use of the F test. Normal probability plots from 
Minitab are shown in Figure 10.9. There is no apparent reason to doubt normality of either population 
distribution here. 
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2 brand 
—®— bruegger's 
—#®— wolferman's 
1 
2 
8 0 
op) 
-1 
-2 


90 95 100 105 110 115 120 
grams 


Figure 10.9 Normal plot for baked goods 


Notice the difference in slopes for the two sources. This suggests different variabilities because the 
z-score vertical axis is related to the horizontal axis (grams) by z = (grams — mean)/(std dev). Thus, 
when score is plotted against grams the slope is the reciprocal of the standard deviation. Now let’s test 
Ho: 0, = G, against a two-sided alternative with « = .02. We need the critical values F'91,5,7 = 7.46 
and F'995.7 = 1/F.o1,7,5 = 1/10.46 = .0956; here we have used the reciprocal property 


Fovivn = LP iii (10.9) 


from Section 6.3. From the sample data 


which exceeds 7.46, so the hypothesis of equal standard deviations is rejected. We conclude that there 
is a difference in weight variation, and the English muffins are less variable. 

Notice that it is not really necessary to use the lower-tailed critical value here if the groups are 
chosen so the first group has the larger variance, and therefore the value of f = st/s5 exceeds 1. 
Because f > 1, the only comparison is between the computed f and the upper critical value 7.46. It 
does not change the result of the test to fix things so f > 1, so it is not cheating to simplify the test in 
this way. a 


P-Values for F Tests 

Recall that the P-value for an upper-tailed ¢ test is the area under the relevant ¢ curve (the one with 
appropriate df) to the right of the calculated t. In the same way, the P-value for an upper-tailed F test 
is the area under the F curve with appropriate numerator and denominator df to the right of the 
calculated f. Figure 10.10 illustrates this for a test based on v; = 4 and v2 = 6. 
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F curve for 
we vy, =4,¥,=6 


Shaded area = P-value 
= .025 


A 


f= 6.23 


Figure 10.10 A P-value for an upper-tailed F test 


Unfortunately, tabulation of F curve upper-tail areas is much more cumbersome than for ¢ curves 
because two df’s are involved. For each combination of v; and v2, our F table gives only the four 
critical values that capture areas .10, .05, .01, and .001. Figure 10.11 shows what can be said about 
the P-value depending on where f falls relative to the four critical values. 


Vo a 1 4 
6 .10 3.18 
05 4.53 
Ol 9.15 
001 21.92 


ee Nene a 


P-value>.10 .01 <P-value <.05 .001 < P-value <.01 P-value<.001 
.05 < P-value <.10 


Figure 10.11 Obtaining P-value information from the F table for an upper-tailed F test 


For example, for a test with v,; = 4 and v2 = 6, 


f = 2.16 => P-value > .10 
f = 5.70 => .01 < P-value < .05 
f = 25.03 => P-value < .001 


Once we know that .01 < P-value < .05, Hy would be rejected at a significance level of .05 but not at 
a level of .01. When P-value < .001, Ho should be rejected at any reasonable significance level. 
The F tests discussed in succeeding chapters will all be upper-tailed. If, however, a lower-tailed 
F test is appropriate, then (10.9) should be used to obtain lower-tailed critical values so that bounds on 
the P-value can be established. In the case of a two-tailed test, the bounds from a one-tailed test should 
be multiplied by 2. For example, if f = 5.82 when v, = 4 and v> = 6, then since 5.82 falls between the 
.O5 and .01 critical values, 2(.01) < P-value < 2(.05), giving .02 < P-value < .10. Ho would then be 
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rejected if « = .10 but not if « = .01. In this case, we cannot say from our table what conclusion is 
appropriate when a = .05 (since we don’t know whether the P-value is smaller or larger than this). 
However, statistical software shows that the area to the right of 5.82 under the F'4.¢ curve is .029, so the 
P-value is 2(.029) = .058 and the null hypothesis should therefore not be rejected at level .05. More 
generally, good statistical software will provide an exact P-value for any test based on an 
F distribution. 


A Confidence Interval for o; /c2 
The CI for o,/a, is based on the probability statement implied by (10.8): 


Se fat 
P (Fist < S/a2 <Fy/ovsv2 =l-a@ 


Manipulating the inequalities to isolate ai / a5 yields 


2 2: 2 2 
S| 1 of Sy 1 _ Ss F 
= Pa /2,v2,¥1 


2° 2 2 
59 Fry/2,y1.»2 07 S89 FP -4/2,v,02 a) 


Equation (10.9) has been used here to simplify the upper bound and enable use of Table A.8. Thus 
the confidence interval for o{/o% is 


s? 1 Oa 
; oF, /2,n—-1,m—1 
2 ee) a/2n—l, 
S5 Fy/2,m—-1,n-1 S5 


An interval for o;/o2 results from taking the square root of each limit: 


S| 1 Sy] 
. , NV Fy /2,n—1,m-1 
52. 4/ Fy/2,m—1n-1 52 


In the interval for the ratio of population standard deviations, notice that the limits of the interval are 
proportional to the ratio of sample standard deviations. Of course, the lower limit is less than the ratio 
of sample standard deviations, and the upper limit exceeds it. 


Example 10.16 Let’s calculate a confidence interval using the data of Example 10.15. The sample 
standard deviations are s; = 6.765 for 6 Bruegger’s bagels and sy = 2.338 for 8 Wolferman English 
muffins. Then a 98% confidence interval for the ratio a; /a2 is 


6.765 1 6.765 
V7.46 


D338 J Poise 2008 


Because | is not included in the interval, it suggests that the two standard deviations differ. By comparing 
the CI calculation with the hypothesis test calculation, it should be clear that a two-tailed test would reject 
equality at the 2% level, and this is consistent with the results of Example 10.15. Hi 


1 
Vai) = (289-ee, 2.89. 046 = (1.06, 9.35) 
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It is important to emphasize that the methods of this section are strongly dependent on the 
normality assumption. Expression (10.8) is valid only in the case of normal data or nearly normal 
data. Otherwise, the F distribution in (10.8) does not apply. The ¢ procedures of this chapter are robust 
to the normality assumption, meaning that the procedures still work in the case of moderate depar- 
tures from normality, but this is not true for comparison of standard deviations based on (10.8). 

For nonnormal data, alternative tests for equal variance are available, including Levene’s test and 
an extension of the method by Bonett mentioned in Section 8.4. Consult your local statistician for 
more information. 


Exercises: Section 10.5 (71-78) 


71. Obtain or compute the following quantities: obtaining as much information as you can 
about the P-value. 
- Fo5,5,.8 


. Fo5,8,5 


a 
b 74. Return to the data on maximum lean angle 
C. F955, 

d 

e 


given in Exercise 29. Carry out a test at 
significance level .10 to see whether the 


. Fo5,.8,5 oe 
. The 99th percentile of the F distribution population standard deviations for the two 
with v, = 10, v> = 12 age groups are different (normal probability 
f. The Ist percentile of the F distribution plots Supper the ‘necessary normality 
with v, = 10, v> = 12 assumption). 
g. PF < 6.16) for vy = 6, v2 = 4 75. Refer to the railway repair time data in 
h. P0177 < F < 4.74)forv,; = 10,v2. =5 Exercise 16. Carry out a test at significance 


level .01 to see whether the population 
standard deviation for time-to-repair is larger 
for high rail breaks than for low rail breaks. 


72. Give as much information as you can about 
the P-value of the F test in each of the 
following situations: 

76. Exercise 35 presented data on the pour size 


aD) eS TO) URDEiares aes of two groups of experienced bartenders, 


b Z ns aah ‘led one group pouring into tumblers and the 
a we Mae MEDEes “lel, other into highball glasses. Test the 


hypothesis that the variability in the two 
(conceptual) populations of pour sizes is 
different, at the « = .02 level. 


c. Vv; = 5, v2 = 10, two-tailed test, f= 5.64 
d. vy} =5, v2=10, lower-tailed test, 
f= .200 


e. v; =35, v2=20, uppertailed test, 77. Return to the fat loss experiment described 


f= 3.24 in Exercise 24. Calculate a 95% CI for the 
ratio of the population standard deviations 
73. Refer to Exercise 41. Does the data suggest for the experimental and control groups. 


that the standard deviation of the strength 
distribution for fused specimens is smaller 
than that for not-fused specimens? Carry 
out a test at significance level .01 by 


78. For the data of Exercise 29 find a 90% 
confidence interval for the ratio of popula- 
tion standard deviations, and relate your CI 
to the test of Exercise 74. 
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10.6 Inferences Using the Bootstrap and Permutation Methods 


In this chapter we have discussed how to make comparisons based on normal data. We have also 
considered comparisons of means when the sample sizes are large enough for ¢ procedures to apply 
even in the absence of normality. What about other cases (e.g., smaller, skewed data sets), for which 
such methods are inappropriate? We now consider computer-intensive “resampling” techniques for 
confidence intervals and hypothesis tests that can be applied to many comparison situations. 


The Two-Sample Bootstrap Cl 

The bootstrap for two samples is similar to the one-sample bootstrap of Section 8.5, except that 
samples with replacement are taken from the two original samples separately. That is, a resample is 
taken from the first group, a separate resample is taken from the second group, and then the difference 
of means (or some other comparison statistic) is computed. This process is repeated a large number of 
times, resulting in the bootstrap distribution of the comparison statistic. 

If the bootstrap distribution appears normal, then a bootstrap t confidence interval can be com- 
puted in a similar manner to the one presented in Section 8.5, with spooe replacing the standard error 
expression from Section 10.2. If the statistic being bootstrapped is X — Y, it is common to use a 
conservative ¢ critical value with df = min(m — 1, n — 1); Welch’s df formula (10.3) is also sometimes 
used in practice, partly to agree with the classic two-sample ¢ interval for 4; — fl). (On the other hand, 
if we are bootstrapping a difference of medians or trimmed means there is no concern about 
agreement with a ¢ interval.) Another reasonable alternative is to use a z critical value (df = ©0; the 
software package Stata does this). 

If the bootstrap distribution does not look normal, then the percentile interval should be calculated, 
just as was done in Section 8.5. A CI with confidence level approximately 95% requires determining 
the 2.5 and 97.5 percentiles of the bootstrap distribution. The bias corrected and adjusted (BCa) in- 
terval is a further refinement available in some software packages, including R and Stata. Once a valid 
100(1 — «)% CI has been calculated, the hypothesis 4; — 44, = Ao is rejected at significance level « in 
favor of the two-sided (i.e., 4) alternative if and only if the CI does not include Apo. 


Example 10.17 As an example of the bootstrap for two samples, consider data from a study of 
children talking to themselves (private speech), introduced in Exercise 81 of Chapter 1. The children 
were each observed in many 10-second intervals (about 100) and the researchers computed the 
percentage of intervals in which private speech occurred. Because private speech tends to occur when 
there is a challenging task, the students were observed when they were doing arithmetic. The private 
speech is classified as on task if it is about arithmetic, off task if it is about something else, and 
mumbling if the subject is not clear. 
Here we consider just the off-task percentages for the 18 male and 15 female first graders: 


B: 4.9, 5.5, 6.5, 0.0, 0.0, 3.0, 2.8, 6.4, 1.0, 0.9, 0.0, 28.1, 8.7, 1.6, 5.1, 17.0, 4.7, 28.1 
G: 0.0, 1.3, 2.2, 0.0, 1.3, 0.0, 0.0, 0.0, 0.0, 3.9, 0.0, 10.1, 5.2, 3.2, 0.0 

The two-sample finterval of Section 10.2 should not be applied: the sample sizes are rather small, and 
with the large number of zeroes (a majority for the girls), the population normality assumption is clearly 
violated. Nevertheless, it is useful to give that CI purely for comparison purposes. The 95% interval is 


2 2 ; 3 
ea pe O12 7 iE 8.7197 2.846 
X— YF toasw\i 7g + 75 = 6.906 — 1.813 + 2.080 ig a6 


= 5.093 + 2.080(2.1825) = 5.093 + 4.540 = (.55, 9.63) 


Welch's formula was used to obtain v = 21. 
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Again, the t method is of questionable validity, because the sample sizes might not be large enough 
to compensate for the nonnormality. The bootstrap method involves drawing a random resample of 
size 18 with replacement from the 18 boys, drawing a random resample of size 15 with replacement 
from the 15 girls, and calculating the difference of resampled means x* — y*. This process is repeated 
a large number of times, creating a bootstrap distribution for the statistic X — Y. Here are random 
resamples from the boys and girls: 

B: 0.0, 3.0, 2.8, 0.9, 3.0, 0.0, 0.0, 6.5, 6.4, 8.7, 6.4, 1.0, 0.9, 5.5, 17.0, 17.0, 0.0, 3.0 
G: 1.3, 0.0, 0.0, 0.0, 0.0, 1.3, 1.3, 0.0, 3.2, 0.0, 1.3, 5.2, 0.0, 0.0, 0.0 

For these two resamples, the difference of means is x* — y* = 4.56 — .91 = 3.65. Doing this 
1000 times (using the R package boot) gives the bootstrap distribution displayed in Figure 10.12. 

The bootstrap distribution looks almost normal, but with some positive skewness. If the original 
sample of boys and girls is representative of their populations, then the histogram in Figure 10.12 
should resemble the true sampling distribution of X — Y in this scenario. For example, the standard 
deviation of the bootstrap distribution (i.e., of the 1000 x* — y* values) is spoot = 2.1874, very close to 
the 2.1825 that was computed for the estimated standard error in the two-sample f interval above. 


0.15 — 1 10 7 
=> 0.10 ES 
7 og _| 
e I< 
oO 
fa) 
0.05 4 
07 $ 
ae 
0.00 a | a ra 
0 5 10 15 -3 -2 -1 O 1 2 3 
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Figure 10.12 Histogram and normal plot of the bootstrapped difference in means from R 


In the presence of a not-quite-normal bootstrap distribution, we use the percentile interval. The 
confidence limits for a 95% confidence interval are the 2.5 percentile and the 97.5 percentile of the 
x* — y* distribution. When the 1000 bootstrap differences of means were sorted, the 25th value from 
the bottom was 1.029 and the 25th value from the top was 9.760. This gives a 95% CI of (1.029, 
9.760). The skewness of the bootstrap distribution pushes the endpoints a little to the right of the 
endpoints of the two-sample f interval. In addition, one can use software to compute the BCa 
refinement, as discussed in Section 8.5. The improved interval (1.625, 10.446), obtained from R, is 
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moved even farther to the right compared to the previous intervals. This last interval is the most 
trustworthy. 

Neither of these bootstrap intervals includes 0, implying that the hypothesis 4, — 2 = 0 should be 
rejected at the .05 significance level in favor of the conclusion that the two population means are 
different. i 


Permutation Tests 

Permutation tests provide a template for comparing two (or more) populations without requiring large 
samples or assuming any specific distribution for the data. The relevant null hypothesis is that the two 
population distributions are identical (which implies both equal means and equal standard deviations). 
The idea behind such tests is that under the null hypothesis, every observation comes from the same 
distribution, and so the group labels (e.g., population 1 vs population 2 or treatment vs control) are 
meaningless. If that’s true, then we can permute—that is, scramble or rearrange—the group labels 
without changing the group population means. We look at all possible label arrangements (or at least 
a large number), compute the difference of means for each of these, and compute a P-value by seeing 
how extreme is our original difference of means. 


Example 10.18 The article “Comparison of Platelet Function and Viscoelastic Test Results between 
Healthy Dogs and Dogs with Naturally Occurring Chronic Kidney Disease” (Amer. J. Veterinary Res. 
2017: 589-600), first presented in Example 1.18, provides data on the fibrinogen levels (mg/dl of 
blood) for two samples of dogs: 11 with chronic kidney disease (CKD) and 10 with normal kidney 
function. Researchers were concerned that CKD increases the production of fibrinogen, which can 
lead to excessive blood clotting. Boxplots of both samples show considerable skewness and the 
sample sizes are small, so a two-sample ¢ test would be of questionable validity. 

In order to demonstrate the permutation test method, consider an even smaller-scale version of this 
data: measurements 315, 290, 275 for the CKD dogs (m = 3) and 313, 250 for the healthy dogs 
(n = 2). Under the null hypothesis of equal population distributions, it should not matter if we reassign 
the labels “CKD” and “healthy.” Therefore, we consider all ways of selecting three from among the 
five observations to be the CKD sample, leaving the other two for the healthy sample. Under Ho, the 
ten choices listed in Table 10.4 are equally likely. 


Table 10.4 All possible rearrangements of m = 3 and n = 2 observations 


CKD dogs x Healthy dogs y x-—y 
315 290 275 293.3 313 250 281.5 11.8 
315 290 313 306.0 275 250 262.5 43.5 
315 290 250 285.0 313 275 294.0 -9.0 
315 219 313 301.0 290 250 270.0 31.0 
315 275 250 280.0 290 313 301.5 21.5 
315 313 250 292.7 290 275 282.5 10.2 
290 275 313 292.7 315 250 282.5 10.2 
290 275 250 271.7 315 313 314.0 42.3 
290 313 250 284.3 275 315 295.0 -10.7 


215 313 250 279.3 290 315 302.5 23.2 
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How extreme is our original difference of means, 293.3 — 281.5 = 11.8 (the top row of Table 10.4), 
in this set of ten differences? Because it is the third-largest of the ten x — y values, our P-value for 
one-sided test is 3/10 = .3, the fraction of arrangements that give a difference at least as large as our 
original difference. a 


When m = 3 andn = 2, it is simple enough to deal with all i) = 10 arrangements. What happens 
when we try to use the whole set of 21 dogs? 


Example 10.19 (Example 10.18 continued) Now consider a permutation test for the full dog health 
data: 


CKD dogs: 183, 190, 250, 275, 290, 315, 320, 330, 410, 500, 821 (m = 11, x = 353.1) 
Healthy dogs: 99, 160, 165, 170, 178, 181, 190, 201, 250, 313 (n = 10, y = 190.7) 


Here we are dealing with Go) = 352,715 possible permutations of the 11 CKD dogs and 10 


healthy dogs. Even on a reasonably fast computer it might take a while to generate this many 
differences and see how many are at least as large as the value x — y = 353.1 — 190.7 = 162.4 from 
the original data. Instead, we can take a random sample of all possible arrangements and get quite 
close to the exact answer. Figure 10.13 shows a histogram of 2000 values of x — y created by 
randomly permuting the group labels; though this is short of all possible arrangements, this should 
give us a reasonable estimate of the P-value. Of the 2000 label permutations, only two resulted in an 
x — y value of 162.4 or higher, for an estimated P-value of 2/2000 = .001. Thus, we have convincing 
statistical evidence to conclude that dogs with chronic kidney disease do have higher blood fibrinogen 
levels, on average, than healthy dogs. 


Randomization Test Histogram for CKD, Healthy 
Ho: Hs - H2 = 0, Hi: pr - He > 0 
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Figure 10.13 Permutation distribution for Example 10.19 | 


The method shown in Example 10.19 could equally be applied to the difference of two sample 
medians or any other comparison statistic. The general permutation test method is summarized in the 
accompanying box. 
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Permutation Tests Let 6; and 02 be the same parameters (means, medians, standard deviations, 
etc.) for two different populations, and consider testing Hp: 0; = 02 based on 
independent samples of sizes m and n, respectively. Suppose that when Ho is 
true, the two population distributions are identical in all respects, so all 
m +n observations have actually been selected from the same population 
distribution. In this case, the labels 1 and 2 are arbitrary, as any m of the 
m +n observations have the same chance of ending up in the first sample 
(leaving the remaining n for the second sample). 

An exact permutation test computes a suitable comparison statistic for all 
possible rearrangements and sets the P-value equal to the fraction of these that 
are at least as extreme as the statistic computed on the original samples. This is 
the P-value for a one-tailed test, and it needs to be doubled for a two-tailed test. 

For an approximate permutation test, instead of all possible arrangements, 
we take a random sample with replacement from the set of all possible 
arrangements. 


Permutation tests do not assume a specific underlying distribution, such as the normal distribution. 
However, this does not mean that there are no assumptions whatsoever. The null hypothesis in a 
permutation test is that the two distributions are the same, and any deviation can increase the prob- 
ability of rejecting the null hypothesis. Thus, strictly speaking, we are doing a test for equal means only 
if the distributions are alike in all other respects, including shape and variability. See Exercise 94 for a 
(pathological) example in which the permutation test underestimates the true P-value. 


Inferences Based on Other Statistics 

The bootstrap and permutation methods are not limited to comparing means. In any of the previous 
examples, we could have considered the difference of two medians or two trimmed means instead. 
Likewise, these methods can be employed for inferences concerning the variability of two populations. 
Section 10.5 discussed the use of the F distribution for comparing two variances, but this inferential 
method is strongly dependent on normality. Bootstrapping does not require this assumption. 


Example 10.20 Consider the off-task private speech data from Example 10.17. The sample standard 
deviations for boys and girls are 8.72 and 2.85, respectively. The method of Section 10.5 gives for the 
ratio of male to female variances the 95% confidence interval 


2 2 
(3 1 Sy 1 ) 
2 a) = 
85 F925,17,14 83 F’975,17,14 


ae | 72? 1 3.23.25.77 
(se 2.900’ 2.852 =u) =a) 
Taking the square root gives (1.80, 5.08) as the 95% confidence interval for the ratio of standard 
deviations. However, the legitimacy of this interval is seriously in question because of the skewed 
distributions. 

Let’s apply the bootstrap method to this problem. Take random resamples of 18 boys and 15 girls, 
calculate the standard deviations s} and s5 of the two resamples, and then compute their ratio s}/s3. 
One such pair of resamples returned s} = 5.264 for the boys and s; = 1.505 for the girls, for a ratio 
of 5.264/1.505 = 3.498. This process was repeated 1000 times using the boot package in R. 

The resulting bootstrap distribution (not shown) is strongly skewed to the right, so a percentile 
interval is required. The 2.5 percentile is 1.013 and the 97.5 percentile is 7.888, so the 95% confi- 
dence interval for the population ratio of standard deviations is (1.013, 7.888). The BCa refinement 
gives the interval (0.885, 7.382). 
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These two intervals differ in an important respect: the percentile interval excludes 1 but the BCa 
refinement includes 1. In other words, the BCa interval allows the possibility that the two population 
standard deviations are the same, but the percentile interval does not. We expect the BCa method to 
be an improvement, and this is verified in the next example, where we see that the BCa result is 
consistent with the results of a permutation test. Hi 


Next, consider testing Hp: 0; = 2. Again, the traditional F test of Section 10.5 requires data from 
normal populations and is not robust to violations of that assumption, so its validity is questionable 
for nonnormal data (even when the sample sizes are large). Instead, we might use a permutation test. 

It must be re-emphasized that the permutation assumes identical distributions under Ho, not just 
identical sd’s. So, for instance, data from two populations with very different shapes or means but the 
same variability would likely result in a low P-value from the permutation test, even if the statement 
0| = G2 is true. In fairness, the F test also assumes identical shapes (the normal curve), though not 
necessarily the same means. A graphical exploration of the data may illuminate the nature of the 
differences between two groups if a permutation test rejects its null hypothesis. 


Example 10.21 (Example 10.20 continued) We know that the ratio of sample standard deviations 
for off-task private speech, males versus females, is 8.72/2.85 = 3.064. The idea of the permutation 
test is to find out how unusual this value is if we blur the distinction between males and females. That 
is, we remove the labels from the 18 males and 15 females and then consider all possible choices of 
18 from the 33 children. For each of these possible choices we find the ratio of the standard deviation 
of the first 18 to the standard deviation of the last 15. The one-tailed P-value is the fraction that is at 
least as big as the original ratio value of 3.064. 

Because there are more than a billion possible choices of 18 from 33, we instead selected 5000 
random choices. Of these, 432 were at least as large as 3.064, so the one-tailed P-value is 
432/5000 = .0864. For a two-tailed P-value we double this and get .1728. The permutation test does 
not reject Ho: 0; = 2 in favor of H,: a; 4 o> at the 5% level (or even the 10% level). 

How does the permutation test result compare with the previous results? Recall that the F interval and 
the “unadjusted” percentile interval ruled out the possibility that the two standard deviations are the 
same, but the BCa refinement disagreed, because 1 was included in the BCa interval. Taking it for 
granted that the permutation test is a valid approach and the permutation test does not reject the equality 
of standard deviations, the BCa interval is the only one of the three CIs consistent with this result. 1 


The Analysis of Paired Data 
The bootstrap can be used for paired data if we work with the paired differences, as in the paired 
t methods of Section 10.3. 


Example 10.22 Consider once again the private speech study from Example 10.17. The study 
included the percentage of intervals with on-task private speech for 33 children in the first, second, 
and third grades. Here we will consider just the 15 girls’ scores in first and second grade. Is there a 
change in on-task private speech when the girls go from the first to the second grade? Here are the 
percentages of intervals in which on-task private speech occurred, and also the differences. 


Grade 1 Grade 2 Difference 
25.7 18.6 Tell 
36.0 17.4 18.6 
27.6 2.6 25.0 
29.7 0.9 28.8 
36.0 1.5 34.5 
35.1 14.1 21.0 
42.0 3.3 38.7 


(continued) 
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Grade 1 Grade 2 Difference 
7.6 1.6 6.0 
14.1 0.0 14.1 
25.0 1.5 23.5 
20.2 0.0 20.2 
24.4 2.1 22.3 
10.4 18.4 —8.0 
21.1 2.6 18.5 
5.6 26.0 —20.4 


Figure 10.14 shows a histogram for the differences; there is a pronounced negative skew. 
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Figure 10.14 Histogram of differences for girls from Stata 


The paired t method of Section 10.3 requires normality, so the skewness might invalidate this, but 
we present results here anyway for comparison purposes. The 95% confidence interval for the 
population mean difference is 


_ 5 15.43 
d + toos.1s—1 Fi = 16.66 + 2.145 Tig = 16.66 + 8.54 = (8.12, 25.20) 


The bootstrap focuses on the 15 differences and uses the method of Section 8.5. Using Stata, we 

drew 1000 resamples of size 15 with replacement from the 15 differences; the 1000 resample means 

een Slaaay constitute the bootstrap distribution. Figure 10.15 shows a histogram of these mean 
differences. 

The histogram of d* values is negatively skewed, which is expected because of the negative 
skewness shown in Figure 10.14 for the original sample. The 95% percentile interval has the 2.5th 
percentile of the bootstrap distribution as its lower limit and the 97.5th percentile as its upper limit: 
(7.91, 23.97). This interval is to the left of the paired ¢ interval because of the negative skewness of 
the bootstrap distribution. The BCa refinement from Stata yields the interval (6.43, 23.12), which is 
even farther to the left. 
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Figure 10.15 Histogram of bootstrap differences for girls from Stata 


All of the intervals agree that there is a substantial population difference between first grade and 
second grade. There is a strong reduction in the on-task private speech of girls between first and 
second grades. | 


A permutation test for paired data involves permutations within the pairs. Under the null 
hypothesis of identical population distributions, the two observations in a pair have the same pop- 
ulation mean, so the population mean difference is zero even if the order is reversed. Therefore, we 
consider all possible orderings of the n pairs. Because there are two possible orderings within each 
pair, there are 2” arrangements of n pairs. The one-tailed P-value is the fraction of the 2” differences 
that are at least as extreme as the observed value, and the two-tailed P-value is double this. 


Example 10.23 To see how the permutation test works for paired data, first consider a scaled-down 
version of the data from Example 10.22 with only the first three pairs: (25.7, 18.6), (36.0, 17.4), and 
(27.6, 2.6). They give a mean difference of (7.1 + 18.6 + 25.0)/3 = 16.9. Here are all 8 = on per- 
mutations with the corresponding mean differences. 


Arrangements Mean difference 
(25.7, 18.6) (36.0, 17.4) (27.6, 2.6) 16.90 
(25.7, 18.6) (36.0, 17.4) (2.6, 27.6) 23 
(25.7, 18.6) (17.4, 36.0) (27.6, 2.6) 4.50 
(25.7, 18.6) (17.4, 36.0) (2.6, 27.6) —12.17 
(18.6, 25.7) (36.0, 17.4) (27.6, 2.6) 12.17 
(18.6, 25.7) (36.0, 17.4) (2.6, 27.6) —4.50 
(18.6, 25.7) (17.4, 36.0) (27.6, 2.6) —.23 


(18.6, 25.7) (17.4, 36.0) (2.6, 27.6) —16.90 
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Because the mean difference for the original sample is the highest value of eight, the one-tailed 
P-value is 1/8 = .125, and the two-tailed P-value is 2(1/8) = .25. 

Next, let’s apply the permutation test to the paired data for all 15 girls of Example 10.22. In 
principle it is no harder to deal with the 2” = 2'5 = 32,768 arrangements when all 15 pairs are 
included, but this exact approach is generally approximated using a random sample. We used Stata to 
draw an additional 4999 samples. Of the 4999, none yielded a mean difference as large as the value 
d = 16.66 obtained for the original sample of 15 differences. Therefore, the one-tailed P-value is 
1/5000 = .0002, and the two-tailed P-value is 2(.0002) = .0004. Rejection of the null hypothesis at 
the 5% level was to be expected, given that none of the confidence intervals in Example 10.22 
included 0. 

It is interesting to compare the permutation test result with the paired ¢ test of Section 10.3. For 
testing the null hypothesis of 0 population mean difference, the value of f is 


d—0O 16.66 


sp/V15 15.425/V/15 


The two-tailed P-value for this is .0009, not very different from the result of the permutation test. Il 


Exercises: Section 10.6 (79-94) b. Use software to generate a bootstrap 
sample of differences of means. Check 
79. A student project by Heather Kral studied the bootstrap distribution for normality 
students on “lifestyle floors” of a dormitory using a normal probability plot. 
in comparison to students on other floors. c. Use the standard deviation of the boot- 
On a lifestyle floor the students share a strap distribution along with the mean 
common major, and there are a faculty and ¢ critical value from (a) to get a 95% 
coordinator and resident assistant from that confidence interval for the difference of 
department. Here are the GPAs of 30 stu- means. 
dents on lifestyle floors (L) and 30 students d. Use the bootstrap sample and the per- 
on other floors (N): centile method to obtain a 95% confi- 
dence interval for the difference of 
L: 2.00 2.25 2.60 2.90 3.00 3.00 3.00 3.00 means. 


e. Compare your three confidence inter- 


3.80 3.90 4.00 4.00 4.00 4.00 vals. If they are very similar, why do 
N: 1.20 2.00 2.29 245 2.50 2.50 2.50 2.50 you think this is the case? 

Oe ee f. Interpret your results. Is there a sub- 

2.86 2.90 3.00 3.07 3.10 3.25 3.50 3.54 


3.56 3.60 3.70 3.75 3.80 4.00 stantial difference between lifestyle and 
other floors? Why do you think the 
difference is as big as it is? 

Notice that the lifestyle GPAs have a large 80 

number of repeats and the distribution is 

skewed, so there is some question about 
normality. 


. For the data of the previous exercise, now 
consider testing the hypothesis of equal 
population variances. 

a. Carry out a two-tailed test using the 

a ODI ae pane ee aaa method of Section 10.5. Recall that this 

the difference of population means 


; ; method requires the data to be normal, 
using the two-sample ¢ interval. 


and the method is sensitive to departures 
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from normality. Check the data for 
normality to see if the F test is justified. 

b. Carry out a two-tailed permutation test 
for the hypothesis of equal population 
variances (or standard deviations). Why 
does it not matter whether you use 
variances or standard deviations? 

c. Compare the two results and summarize 
your conclusions. 


For the data of the previous two exercises, 
we want a 95% confidence interval for the 
ratio of population standard deviations. 


a. Use the method of Section 10.5. Recall 
that this method requires the data to be 
normal, and the method is sensitive to 
departures from normality. Check the 
data for normality to see if the F distri- 
bution can be used for the ratio of 
sample variances. 

b. Use software to generate a bootstrap 
sample of ratios of standard deviations. 
Then use the percentile method to 
obtain a 95% confidence interval for the 
ratio of population standard deviations. 

c. Compare the two results and discuss the 
relationship of the results to those of the 
previous exercise. 


In this application from major league base- 
ball, the populations represent an abstrac- 
tion of what the players can do, so the 
populations will vary from year to year. The 
Colorado Rockies and the Arizona Dia- 
mondbacks played nine games in Phoenix 
and ten games in Denver in 2001. The 
thinner air in Denver causes curve balls to 
curve less and it allows fly balls to travel 
farther. Does this mean that more runs are 
scored in Denver? The numbers of runs 
scored by the two teams in the nine Phoenix 
games (P) and ten Denver games (D) are 


5.09 15.88 3 8.47 11.65 
6.48 11.65 TAIL 9.53 

10 18 15.56 19 8.1 
14 13.76 10 20.12 10.59 


The fractions occur because the numbers 
have been adjusted for nine innings (54 
outs). For example, in the third Denver 
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game the Rockies won 10 to 7 on a home 

run with two out in the bottom of the tenth 

inning, so there were 59 outs instead of 54, 

and the number of runs is adjusted to 

(54/59)(17) = 15.56. We want to compare 

the average runs in Denver with the aver- 

age runs in Phoenix. 

a. Find a 95% confidence interval for the 
difference of population means using 
the two-sample ¢ interval. 

b. Use software to generate a bootstrap 
sample of differences of means. Check 
the bootstrap distribution for normality 
using a normal probability plot. 

c. Use the standard deviation of the boot- 
strap distribution along with the mean 
and f critical value from (a) to get a 95% 
confidence interval for the difference of 
means. 

d. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of 
means. 

e. Compare your three confidence inter- 
vals. If you used a standard normal 
critical value in place of the ¢ critical 
value in (c), why would that make this 
interval more like the one in (d)? Why 
should the three intervals be fairly 
similar for this data set? 

f. Interpret your results. Is there a sub- 
stantial difference between the two 
locations? Compare the difference with 
what you thought it would be. If you 
were a major league pitcher, would you 
want to be traded to the Rockies? 


For the data of the previous exercise we 
want to compare population medians for 
the runs in Denver versus the runs in 
Phoenix. 


a. Use software to generate a bootstrap 
sample of differences of medians. Check 
the bootstrap distribution for normality 
using a normal probability plot. 

b. Use the standard deviation of the boot- 
strap distribution along with the differ- 
ence of the medians in the original 
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sample and the ¢ critical value from the 
previous exercise to get a 95% confi- 
dence interval for the difference of 
population medians. 


c. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of 
population medians. 

d. Compare the two confidence intervals. 

e. How do the results for the median 
compare with the results for the mean? 
In terms of precision (measured by the 
width of the confidence interval) which 
gives the best results? 


Can the right diet help us cope with dis- 
eases associated with aging such as Alz- 
heimer’s disease? A study (“Reversals of 
Age-Related Declines in Neuronal Signal 
Transduction, Cognitive, and Motor 
Behavioral Deficits with Blueberry, Spi- 
nach, or Strawberry Dietary Supplement,” 
J. Neurosci. 1999: 8114-8121) investigated 
the effects of fruit and vegetable supple- 
ments in the diet of rats. The rats were 
19 months old, which is aged by rat stan- 
dards. The 40 rats were randomly assigned 
to four diets, of which we will consider just 
the blueberry diet and the control diet here. 
After 8 weeks on their diets, the rats were 
given a number of tests. We give the data 
for just one of the tests, which measured 
how many seconds they could walk on a 
rod. Here are the times for the ten control 
rats (C) and ten blueberry rats (B): 


15.00 7.00 2.44 5.60 3.63 
6.24 4.12 8.21 3.90 0.95 
5.12 9.38 18.77 15.03 6.67 
7.91 7.38 15.09 11.57 8.98 


The objective is to obtain a 95% confi- 
dence interval for the difference of popu- 
lation means. 


a. Determine a 95% confidence interval 
for the difference of population means 
using the two-sample ¢ interval. 

b. Use software to generate a bootstrap 
sample of differences of means. Check 
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the bootstrap distribution for normality 
using a normal probability plot. 

c. Use the standard deviation of the boot- 
strap distribution along with the mean 
and ¢ critical value from (a) to get a 95% 
confidence interval for the difference of 
means. 

d. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of 
means. 

e. Compare your three confidence inter- 
vals. If they are very similar, why do 
you think this is the case? If you had 
used a critical value from the normal 
table rather than the ¢ table, would the 
result of (c) agree better with the result 
of (d)? Why? 

f. Interpret your results. Do the blueber- 
ries make a substantial difference? 


For the data of the previous exercise, we 
now want to test the hypothesis of equal 
population means. 


a. Carry out a two-tailed test using the two- 
sample ¢ test. Although this test requires 
normal data, it will still work pretty well 
for moderately nonnormal data. Never- 
theless, you should check the data for 
normality to see if the test is justified. 

b. Carry out a two-tailed permutation test 
for the hypothesis of equal population 
means. 

c. Compare the results of (a) and (b). 
Would you expect them to be similar for 
the data of this problem? Discuss their 
relationship to the results of the previ- 
ous exercise. Summarize your conclu- 
sions about the effectiveness of 
blueberries. 


Researchers at the University of Alaska have 
been trying to find inexpensive feed sources 
for Alaska reindeer growers (“Effects of Two 
Barley-Based Diets on Body Mass and 
Intake Rates of Captive Reindeer During 
Winter,” Poster Presentation: School of 
Agriculture and Land Resources 
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Management, University of Alaska Fair- 
banks, 2002). They are focusing on Alaska- 
grown barley because commercially avail- 
able feed supplies are too expensive for 
farmers. Typically, reindeer lose weight in 
the fall and winter, and the researchers are 
searching for a feed to minimize this loss. 
Thirteen pregnant reindeer were randomly 
divided into two groups to be fed on two 
different varieties of barley, thual and 
finaska. Here are the weight gains between 
October | and December 15 for the seven 
that were fed thual barley (T) and the six that 
were fed finaska barley (F). 


Ti =5:83 11S —5.5. —1:33- -=3.83 =—3:33° —7.17 
F: -0.17 —0.67 -4 -3 —1.33 —0.5 


The weight gains are all negative, indicat- 
ing that all of the animals lost weight. The 
thual barley is less fibrous and more 
digestible, and the intake rates for the two 
varieties of barley were very nearly the 
same, so the experimenters expected less 
weight loss for the thual variety. 


a. Determine a 95% confidence interval 
for the difference of population means 
using the two-sample ¢ interval. 

b. Use software to generate a bootstrap 
sample of differences of means. Check 
the bootstrap distribution for normality 
using a normal probability plot. 

c. Use the standard deviation of the boot- 
strap distribution along with the mean 
and ¢ critical value from (a) to obtain a 
95% confidence interval for the differ- 
ence of means. 

d. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of means. 

e. Compare your three confidence inter- 
vals. If they are very similar, why do 
you think this is the case? 

f. Interpret your results. Is there a sub- 
stantial difference? Is it in the direction 
anticipated by the experimenters? 
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Consider using the data of the previous 
exercise to test the hypothesis of equal 
population variances. 


a. Carry out a two-tailed test using the 
method of Section 10.5. Recall that this 
method requires the data to be normal, 
and the method is sensitive to departures 
from normality. Check the data for 
normality to see if the F test is justified. 

b. Carry out a two-tailed permutation test 
for the hypothesis of equal population 
variances (or standard deviations). 

c. Compare the two results and summarize 
your conclusions. 


Recall the scenario from Example 10.8 
about the experiment in the low-level col- 
lege mathematics course. Here are the 85 
final exam scores for those in the experi- 
mental group (E) and the 79 final exam 
scores for those in the control group (C): 


E: 34 27 26 33 23 37 24 34 22 23 32 5 30 


29 0 30 34 26 28 27 32 29 31 33 28 21 
28 35 30 34 9 38 9 27 25 33 9 23 32 
28 38 35 16 37 25 34 38 34 31 35 28 25 
37 28 26 29 22 33 31 23 37 34 29 33 6 
8 29 36 7 21 30 28 34 25 37 28 23 26 
34 32 34 O 24 30 36 31 
37 22 29 29 33 22 32 36 29 6 4 
35 28 33 35 24 21 O 32 28 27 8 7 
35 25 29 3 33 33 28 32 39 20 32 22 24 
38 22 29 29 36 O 32 27 7 19 35 
28 28 32 9 33 30 36 28 3 8 31 
0 O 20 32 7 8 33 29 9 O 30 


a. Determine a 95% confidence interval 
for the difference of population means 
using a two-sample f interval. 

b. Use software to generate a bootstrap 
sample of differences of means. Check 
the bootstrap distribution for normality 
using a normal probability plot. 

c. Use the standard deviation of the boot- 
strap distribution along with the mean 
and ¢ critical value from (a) to get a 95% 
confidence interval for the difference of 
means. 
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d. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of 
means. 

e. Compare your three confidence inter- 
vals. If they are very similar, why do 
you think this is the case? In the light of 
your results for (c) and (d), does the 
two-sample ¢ interval of (a) seem to 
work, regardless of normality? Explain. 

f. Are your results consistent with the 
results of Example 10.8? Explain. 


Return to the data of Example 10.8. 


a. Carry out a two-tailed permutation test 
for the hypothesis of equal population 
means. 

b. Compare the results for (a) and Example 
10.8. Why should you have expected 
(a) and Example 10.8 to give similar 
results? 


For the data of Example 10.8 it might be 
more appropriate to compare medians. 


a. Find the medians for the two groups. 
With the help of a stem-and-leaf display 
for each group, explain why the medi- 
ans are much closer than the means. 

b. Carry out a two-tailed permutation test 
to compare population medians. Given 
what you found in (a), explain why the 
result of the permutation test was to be 
expected. 


Two students, Miguel Melo and Cody 
Watson, compared textbook prices at the 
campus bookstore and Amazon.com. To be 
fair, they included the sales tax for the local 
store and added shipping for Amazon. Here 
are the prices for a sample of 27 books. 


Campus Amazon Campus Amazon 
100.41 106.94 59.50 69.24 
99.34 113.94 87.66 73.84 
51.53 61.44 26.56 33.98 
20.45 31.59 44.63 40.39 
28.69 29.89 96.69 117.99 
70.66 83.94 18.06 27.94 
98.81 107.74 103.06 115.74 
111.56 115.99 14.61 24.69 


(continued) 
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Campus Amazon Campus Amazon 
97.22 108.29 77.03 88.04 
61.89 78.44 99.34 113.94 
70.39 82.94 81.81 90.74 
58.17 65.74 48.88 58.94 

108.38 122.09 76.50 91.94 
61.63 63.49 


a. Determine a 95% confidence interval for 
the difference of population means using 
the t method of Section 10.3. Check the 
data for normality. Even if the normality 
assumption is not valid here, explain 
why the ¢ method (or the z method of 
Section 10.1) might still be appropriate. 

b. Based on the 27 differences, use soft- 
ware to obtain a bootstrap sample of 
mean differences. Check the bootstrap 
distribution for normality. 

c. Use the standard deviation of the boot- 
strap distribution along with the mean 
and f critical value from (a) to get a 95% 
confidence interval for the difference of 
means. 

d. Use the bootstrap sample and the per- 
centile method to obtain a 95% confi- 
dence interval for the difference of 
means. 

e. Compare your three confidence inter- 
vals. In the light of your results for (d), 
does nonnormality invalidate the results 
of (a) and (c)? Explain. 

f. Interpret your results. Is there a sub- 
stantial difference between the two 
ways to buy books? Assuming that the 
populations remain unchanged and you 
have just these two sources, where 
would you buy? 


. Consider testing the hypothesis of equal 


population means based on the data in the 
previous exercise. 


a. Carry out a two-tailed test using the 
method of Section 10.3. Is the normality 
assumption satisfied here? If not, why 
might the test be valid anyway? 

b. Carry out a two-tailed permutation test 
for the hypothesis of equal population 
means. 
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c. Compare the results for (a) and (b). If 
the two results are similar, does it tend 
to validate (a), regardless of normality? 


Compare bootstrapping with approximate 
permutation tests in which random permu- 
tations are used. Discuss the similarities 
and differences. 


Assume that X is uniformly distributed on 
{[—1, 1] and the Y distribution is uniform 
on the two intervals [—101, —100] and 
[100, 101]. Thus the means are both 0, but 
the variances differ substantially. We take 
random samples of size three from each 
distribution and apply a permutation test for 
the null hypothesis Ho: 4, = My against the 
alternative Hy: My <b. 


a. Show that the probability is 1/8 that all 
three of the Y values come from the 
interval [100, 101]. 

b. Show that, if all three Y values come 
from [100, 101], then the P-value for the 
permutation test is .05. 

c. Explain why (a) and (b) are in conflict. 
What is the probability that the permu- 
tation test rejects the null hypothesis at 
the .05 level? 


Supplementary Exercises: (95-124) 


95. 


A group of 115 University of Iowa students 
was randomly divided into a_ build-up 
condition group (m= 56) and a scale- 
down condition group (n = 59). The task 
for each subject was to build his or her own 
pizza from a menu of 12 ingredients. The 
build-up group was told that a basic cheese 
pizza costs $5 and that each extra ingredi- 
ent would cost 50 cents. The scale-down 
group was told that a pizza with all 12 
ingredients (ugh!!!) would cost $11 and 
that deleting an ingredient would save 50 
cents. The article “A Tale of Two Pizzas: 
Building Up from a Basic Product Versus 
Scaling Down from a Fully Loaded Pro- 
duct” (Market. Lett. 2002: 335-344) 
reported that the mean number of ingredi- 
ents selected by the scale-down group was 
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significantly greater than the mean number 
for the build-up group: 5.29 versus 2.71. 
The calculated value of the appropriate 
t statistic was 6.07. Would you reject the 
null hypothesis of equality in favor of 
inequality at a significance level of .05? 
.01? .001? Can you think of other products 
aside from pizza where one could build up 
or scale down? [Note: A separate experi- 
ment involved students from the University 
of Rome, but details were a bit different 
because there are typically not so many 
ingredient choices in Italy.] 


Is the number of export markets in which a 
firm sells its products related to the firm’s 
return on sales? The article “Technology 
Industry Success: Strategic Options for 
Small and Medium Firms” (Bus. Horizons, 
Sept.—Oct. 2003: 41-46) gave the accom- 
panying information on the number of export 
markets for one group of firms whose return 
on sales was less than 10% and another 
group whose return was at least 10%. 


Return Sample size Sample mean Sample SD 
Less than 10% 36 5.12 7 
At least 10% 47 8.26 1.20 
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The investigators reported that an appropri- 
ate test of hypotheses resulted in a P-value 
between .01 and .05. What hypotheses do 
you think were tested, and do you agree with 
the stated P-value information? What 
assumptions if any are needed in order to 
carry out the test? Can the plausibility of 
these assumptions be investigated based just 
on the foregoing summary data? Explain. 


Suppose when using a two-sample t proce- 
dure that m <n, and show that v > m — 1. 
(This is why some authors suggest using 
min(m — 1, n — 1) as df in place of Welch’s 
formula). What impact does this have on 
the CI and test procedure? 


The accompanying summary data on com- 
pression strength (Ib) for 12 x 10 x 8 in. 
boxes appeared in the article “Compression 
of Single-Wall Corrugated Shipping 
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Containers Using Fixed and Floating Test 
Platens” (J. Testing Eval. 1992: 318-320). 
The authors stated that “the difference 
between the compression strength using 
fixed and floating platen method was found 
to be small compared to normal variation in 
compression strength between identical 
boxes.” Do you agree? 


Method Sample size Sample mean Sample SD 
Fixed 10 807 27 
Floating 10 757 41 


99. The authors of the article “Dynamics of 
Canopy Structure and Light Interception in 
Pinus elliotti, North Florida” (Ecol. 
Monogr. 1991: 33-51) planned an experi- 
ment to determine the effect of fertilizer on 
a measure of leaf area. A number of plots 
were available for the study, and half were 
selected at random to be fertilized. To 
ensure that the plots to receive the fertilizer 
and the control plots were similar, before 
beginning the experiment tree density (the 
number of trees per hectare) was recorded 
for eight plots to be fertilized and eight 
control plots, resulting in the given data. 
Minitab output follows. 


Fertilizer 1024 1216 1312 1280 1216 1312 992 1120 
plots 
Control 1104 1072 1088 1328 1376 1280 1120 1200 
plots 


Two sample T for fertilize vs. control 


Mean Std. Dev. SE Mean 
Fertilize 8 1184 126 44 
Control 8 1196 118 42 


95% CI for mu fertilize-mu control: 
(-144, 120) 


a. Construct a comparative boxplot and 
comment on any interesting features. 

b. Would you conclude that there is a 
significant difference in the mean tree 
density for fertilizer and control plots? 
Use a = .05. 
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c. Interpret the given confidence interval. 


100. Is the response rate for questionnaires 
affected by including some sort of incentive 
to respond along with the questionnaire? In 
one experiment, 110 questionnaires with no 
incentive resulted in 75 being returned, 
whereas 98 questionnaires that included a 
chance to win a lottery yielded 66 respon- 
ses (“Charities, No; Lotteries, No; Cash, 
Yes,” Public Opinion Q. 1996: 542-562). 
Does this data suggest that including an 
incentive increases the likelihood of a 
response? State and test the relevant 
hypotheses at significance level .10 by 
using the P-value method. 


101. The article “Quantitative MRI and Elec- 
trophysiology of Preoperative Carpal Tun- 
nel Syndrome in a Female Population” 
(Ergonomics 1997: 642-649) reported that 
(—473.3, 1691.9) was a large-sample 95% 
confidence interval for the difference 
between true average thenar muscle volume 
(mm?) for sufferers of carpal tunnel syn- 
drome and true average volume for non- 
sufferers. Calculate and interpret a 90% 
confidence interval for this difference. 


102. The following summary data on bending 
strength (lb-in/in) of joints is taken from the 
article “Bending Strength of Corner Joints 
Constructed with Injection Molded 
Splines” (Forest Prod. J. April 1997: 89- 


92). Assume normal distributions. 


Sample Sample Sample 
size mean SD 


80.95 9.59 
63.23 5.96 


Type 


Without side coating 10 
With side coating 10 


a. Calculate a 95% lower confidence 
bound for true average strength of joints 
with a side coating. 

b. Calculate a 95% lower prediction bound 
for the strength of a single joint with a 
side coating. 
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c. Calculate a 95% confidence interval for 
the difference between true average 
strengths for the two types of joints. 


An experiment was carried out to compare 
various properties of cotton/polyester spun 
yarn finished with softener only and yarn 
finished with softener plus 5% DP-resin 
(“Properties of a Fabric Made with Tandem 
Spun Yarns,” Textile Res. J. 1996: 607-— 
611). One particularly important character- 
istic of fabric is its durability, that is, its 
ability to resist wear. For a sample of 40 
softener-only specimens, the sample mean 
stoll-flex abrasion resistance (cycles) in the 
filling direction of the yarn was 3975.0, with 
a sample standard deviation of 245.1. 
Another sample of 40 softener-plus speci- 
mens gave a sample mean and sample 
standard deviation of 2795.0 and 293.7, 
respectively. Calculate a confidence interval 
with confidence level 99% for the difference 
between true average abrasion resistances 
for the two types of fabrics. Does your 
interval provide convincing evidence that 
true average resistances differ for the two 
types of fabrics? Why or why not? 


The derailment of a freight train due to the 
catastrophic failure of a traction motor 
armature bearing provided the impetus for a 
study reported in the article “Locomotive 
Traction Motor Armature Bearing Life 
Study” (Lubricat. Engr. August 1997; 12- 
19). A sample of 17 high-mileage traction 
motors was selected, and the amount of 
cone penetration (mm/10) was determined 
both for the pinion bearing and for the 
commutator armature bearing, resulting in 
the following data: 


Motor 1 2 3 4 5 6 
Commutator 211 273 305 258 270 209 
Pinion 226 278 259 244 273 236 
Motor 7 8 9 10 11 12 
Commutator 223 288 296 233 262 291 
Pinion 290 287 287 242 288 242 
Motor 13 14 15 16 17 
Commutator 278 2715 210 272 264 


Pinion 
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Calculate an estimate of the population 
mean difference between penetration for 
the commutator armature bearing and 
penetration for the pinion bearing, and do 
so in a way that conveys information about 
the reliability and precision of the estimate. 
[Note: A normal probability plot validates 
the necessary normality assumption.] 
Would you say that the population mean 
difference has been precisely estimated? 
Does it look as though population mean 
penetration differs for the two types of 
bearings? Explain. 


The article “Two Parameters Limiting the 
Sensitivity of Laboratory Tests of Condoms 
as Viral Barriers” (J. Test. Eval. 1996: 279— 
286) reported that, in brand A condoms, 
among 16 tears produced by a puncturing 
needle, the sample mean tear length was 
74.0 um, whereas for the 14 brand B tears, 
the sample mean length was 61.0 um (de- 
termined using light microscopy and scan- 
ning electron micrographs). Suppose the 
sample standard deviations are 14.8 and 
12.5, respectively (consistent with the 
sample ranges given in the article). The 
authors commented that the thicker brand B 
condom displayed a smaller mean tear 
length than the thinner brand A condom. Is 
this difference in fact statistically signifi- 
cant? State the appropriate hypotheses and 
test at « = .05. 


Information about hand posture and forces 
generated by the fingers during manipula- 
tion of various daily objects is needed for 
designing high-tech hand prosthetic devi- 
ces. The article “Grip Posture and Forces 
During Holding Cylindrical Objects with 
Circular Grips” (Ergonomics 1996: 1163- 
1176) reported that for a sample of 11 
females, the sample mean four-finger pinch 
strength (N) was 98.1 and the sample 
standard deviation was 14.2. For a sample 
of 15 males, the sample mean and sample 
standard deviation were 129.2 and 39.1, 
respectively. 
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2001 
2002 


108. 


a. A test carried out to see whether true 
average strengths for the two genders 
were different resulted in ¢ = 2.51 and 
P-value = .019. Does the appropriate test 
procedure described in this chapter yield 
this value of ¢ and the stated P-value? 

b. Is there substantial evidence for con- 
cluding that true average strength for 
males exceeds that for females by more 
than 25 N? State and test the relevant 
hypotheses. 


After the Enron scandal in the fall of 2001, 
faculty in accounting began to incorporate 
ethics more into accounting courses. One 
study looked at the effectiveness of such 
educational interventions “pre-Enron” and 
“post-Enron.” The data below shows stu- 
dents’ improvement in score on_ the 
Accounting Ethical Dilemma Instrument 
(AEDI) across a one-semester accounting 
class in Spring 2001 (“pre-Enron”) and 
another in Spring 2002 (“post-Enron’’). 
(From “A Note in Ethics Educational 
Interventions in an Undergraduate Auditing 
Course: Is There an ‘Enron Effect’?” Issues 
Account. Educ. 2004: 53-71.) 


Improvement in 


AEDI score 
n Mean SD 
(pre-Enron) 37 5.48 13.83 
(post-Enron) 21 6.31 13.20 


a. Test to see whether the 2001 class 
showed a statistically significant improve- 
ment in AEDI score across the semester. 

b. Test to see whether the 2002 class showed 
a Statistically significant improvement in 
AEDI score across the semester. 

c. Test to see whether the 2002 class showed 
a Statistically significantly —_ greater 
improvement in AEDI score than the 2001 
class. In this respect, does there appear to 
be an “Enron effect’? 


Torsion during hip external rotation 
(ER) and extension may be responsible for 
certain kinds of injuries in golfers and other 
athletes. The article “Hip Rotational 
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Velocities during the Full Golf Swing’ 
(J. Sport Sci. Med. 2009: 296-299) repor- 
ted on a study in which peak ER velocity 
and peak IR (internal rotation) velocity 
(both in deg/s) were determined for a 
sample of 15 female collegiate golfers 
during their swings. The following data was 
supplied by the article’s authors. 


Golfer ER IR Diff. 
1 —130.6 —98.9 -31.7 
2 —125.1 —115.9 —9.2 
3 —51.7 —161.6 109.9 
4 —179.7 -196.9 17.2 
5 —130.5 —170.7 40.2 
6 —101.0 —274.9 173.9 
7 —24.4 —275.0 250.6 
8 —231.1 —275.7 44.6 
9 —186.8 —214.6 27.8 

10 —58.5 -117.8 59.3 

11 —219.3 —326.7 107.4 

12 -113.1 -272.9 159.8 

13 —244.3 —429.1 184.8 

14 —184.4 —140.6 —43.8 

15 —199.2 —345.6 146.4 


a. Is it plausible that the differences came 
from a normally distributed population? 

b. The article reported that mean(sd) = 
—145.3(68.0) for ER velocity and = 
—227.8(96.6) for IR velocity. Based just 
on this information, could a test of 
hypotheses about the difference 
between true average IR velocity and 
true average ER velocity be carried out? 
Explain. 

c. Do an appropriate hypothesis test about 
the difference between true average IR 
velocity and true average ER velocity 
and interpret the result. 


109. The accompanying summary data on the ratio 


of strength to cross-sectional area for knee 
extensors is taken from the article “Knee 
Extensor and Knee Flexor Strength: Cross- 
Sectional Area Ratios in Young and Elderly 
Men” (J. Gerontol. 1992: M204—M210). 


Group Sample size Sample mean — Standard error 
Young 13 TAT 22 
Elderly men 12 6.71 28 
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Does this data suggest that the true average 
ratio for young men exceeds that for 
elderly men? Carry out a test of appropriate 
hypotheses using « = .05. Be sure to state 
any assumptions necessary for your 
analysis. 


110. The accompanying data on response time 
appeared in the article “The Extinguish- 
ment of Fires Using Low-Flow Water Hose 
Streams—Part II” (Fire Tech. 1991: 291- 
320). The samples are independent, not 
paired. 

50 2.75 


Good 43 117 37 AT 68 58 


visibility 
Poor 1.47 
visibility 


80 1.58 1.53 4.33 4.23 3.25 3.22 


The authors analyzed the data with the 
pooled ¢ test. Does the use of this test appear 
justified? [Hint: Check for normality.] 


111. The accompanying data on the alcohol 
content of wine is representative of that 
reported in a study in which wines from the 
years 1999 and 2000 were randomly 
selected and the actual content was deter- 
mined by laboratory analysis (London 


Times August 5, 2001). 


Wine 1 2 3 4 B] 6 


14.2 14.5 14.0 14.9 13.6 12.6 
14.0 14.0 13.5, 15.0 13.0 12.5 


Actual 
Label 


The two-sample ¢ test gives a test statistic 
value of .62 and a two-tailed P-value of 
.55. Does this convince you that there is no 
significant difference between true average 
actual alcohol content and true average 
content stated on the label? Explain. 


112. The article “The Accuracy of Stated Energy 
Contents of Reduced-Energy, Commer- 
cially Prepared Foods” (J. Am. Diet. Assoc. 
2010: 116-123) presented the accompany- 
ing data on vendor-stated gross energy and 
measured value (both in kcal) for 10 dif- 


ferent supermarket convenience meals): 


10 Inferences Based on Two Samples 


Meal 1 2 3 4 5 

Stated 180 220 190 230 200 
Measured 212 319 231 306 211 
Meal 6 7 8 9 10 
Stated 370 250 240 80 180 


Measured 431 288 265 145 228 


Obtain a 95% confidence interval for the 
difference of population means. By roughly 
what percentage are the actual calories 
higher than the stated value? 

Note that the article calls this a conve- 
nience sample and suggests that therefore it 
should have limited value for inference. 
However, even if the ten meals were a 
random sample from their local store, there 
could still be a problem in drawing con- 
clusions about a purchase at your store. 


113. How does energy intake compare to energy 
expenditure? One aspect of this issue was 
considered in the article “Measurement of 
Total Energy Expenditure by the Doubly 
Labelled Water Method in Professional 
Soccer Players” (J. Sports Sci. 2002: 391- 
397), which contained the accompanying 


data (MJ/day). 


Player | 2 3 4 5 6 7 


Expenditure 144 12.1 143 142 15.2 15.5 17.8 
Intake 146 9.2 11.8 11.6 12.7 15.0 16.3 


Test to see whether there is a significant 
difference between intake and expenditure. 
Does the conclusion depend on whether a 
significance level of .05, .01, or .001 is 
used? 


114. An experimenter wishes to obtain a CI for 
the difference between true average break- 
ing strength for cables manufactured by 
company I and by company II. Suppose 
breaking strength is normally distributed 
for both types of cable with o; = 30 psi and 
Oz = 20 psi. 
a. If costs dictate that the sample size for 
the type I cable should be three times 
the sample size for the type II cable, 
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how many observations are required if 
the 99% CI is to be no wider than 20 
psi? 

b. Suppose a total of 400 observations is to 
be made. How many of the observations 
should be made on type I cable samples 
if the width of the resulting interval is to 
be a minimum? 


To assess the tendency of people to 
rationalize poor performance, 246 college 
students were randomly assigned to one of 
two groups: a negative feedback group and 
a positive feedback group. All students took 
a test which asked them to identify people’s 
emotions based on photographs of their 
faces. Those in the negative feedback group 
were all given D grades, while those in the 
positive feedback group received A’s (re- 
gardless of how they actually performed). 
A follow-up questionnaire asked students to 
assess the validity of the test and the 
importance of being able to read people’s 
faces. The results of these two follow-up 
surveys appear below. 


Test validity Face reading 


rating importance 
rating 


Group n x Ss x s 


Positive feedback 
Negative feedback 
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123 
123 


6.95 
5.51 


1.09 
0.79 


6.62 
5.36 


1.19 
1.00 


a. Test the hypothesis that negative feed- 
back is associated with a lower average 
validity rating than positive feedback at 
the « = .O1 level. 

b. Test the hypothesis that students receiv- 
ing positive feedback rate face-reading as 
more important, on average, than do 
students receiving negative feedback. 
Again use a 1% significance level. 

c. Is it reasonable to conclude that the 
results seen in parts (a) and (b) are 
attributable to the different types of 
feedback? Why or why not? 


The insulin-binding capacity (pmol/mg 
protein) was measured for four different 
groups of rats: (1) nondiabetic, (2) untreated 
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diabetic, (3) diabetic treated with a low 
dose of insulin, (4) diabetic treated with a 
high dose of insulin. The accompanying 
table gives sample sizes and sample stan- 
dard deviations. Denote the sample size for 
the ith treatment by n; and the sample 
variance by S?(i=1,2,3,4). Assuming 
that the true variance for each treatment is 
o°, construct a pooled estimator of o° that is 
unbiased, and verify using rules of expected 
value that it is indeed unbiased. What is 
your estimate for the following actual data? 
[Hint: Modify the pooled estimator Se from 
Section 10.2.] 


Treatment 
1 2 3 4 
Sample size 16 18 8 12 
Sample SD 64 81 51 35 
117. Suppose a level .05 test of Ho: dy — fy = 0 


118. 


Sex 


Male 


Female 


versus H,: [lj — Hy > 0 is to be performed, 
assuming o,; = d2 = 10 and normality of 
both distributions, using equal sample sizes 
(m = n). Evaluate the probability of a type 
II error when pt, — fy = 1 and n = 25, 100, 
2500, and 10,000. Can you think of real 
situations in which the difference 
Ly — Uy = Lhas little practical significance? 
Would sample sizes of n= 10,000 be 
desirable in such situations? 


Are male college students more easily bored 
than their female counterparts? This ques- 
tion was examined in the article “Boredom 
in Young Adults—Gender and Cultural 
Comparisons” (J. Cross-Cult. Psych. 1991: 
209-223). The authors administered a scale 
called the Boredom Proneness Scale to 97 
male and 148 female U.S. college students. 
Does the accompanying data support the 
research hypothesis that the mean Boredom 
Proneness Rating is higher for men than for 
women? Test the appropriate hypotheses 
using a .05 significance level. 

Sample size Sample mean 


97 10.40 
148 9.26 


Sample SD 


4.83 
4.68 
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Sex 


Female 


Male 


121. 


Researchers sent 5000 resumes in response 
to job ads that appeared in the Boston 
Globe and Chicago Tribune. The resumes 
were identical except that 2500 of them had 
“white sounding” first names, such as Brett 
and Emily, whereas the other 2500 had 
“black sounding” names such as Tamika 
and Rasheed. The resumes of the first type 
elicited 250 responses and the resumes of 
the second type only 167 responses (these 
numbers are consistent with information 
that appeared in a January 15, 2003, report 
by the Associated Press). Does this data 
strongly suggest that a resume with a 
“black” name is less likely to result in a 
response than is a resume with a “white” 
name? 


Is touching by a coworker sexual harass- 
ment? This question was included on a 
survey given to federal employees, who 
responded on a scale of 1-5, with 1 meaning 
a strong negative and 5 indicating a strong 
yes. The table summarizes the results. 


Sample size 


4343 
3903 


Sample mean 


4.6056 
4.1709 


Sample SD 


8659 
1.2157 


Of course, with 1-5 being the only possible 
values, the normal distribution does not 
apply here, but the sample sizes are suffi- 
cient that it does not matter. Obtain a two- 
sided confidence interval for the difference 
in population means. Does your interval 
suggest that females are more likely than 
males to regard touching as harassment? 
Explain your reasoning. 


Let Xj, ..., X;, be a random sample from a 
Poisson distribution with parameter j1,, and 
let Y;, ..., Y, be a random sample from 
another Poisson distribution with parameter 
Hy. We wish to test Ho: 4, — Ho = 0 against 
one of the three standard alternatives. When 
m and n are large, the CLT justifies using a 
large-sample z test. However, the fact that 
V(X) = p/n suggests that a different 
denominator should be used _ in 
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standardizing X—Y. Develop a_large- 
sample test procedure appropriate to this 
problem, and then apply it to the following 
data to test whether the plant densities for a 
particular species are equal in two different 
regions (where each observation is the 
number of plants found in a randomly 
located square sampling quadrat having 
area 1 m’, so for region 1, there were 40 
quadrats in which one plant was observed, 
etc.): 


Frequency 
Oo 1 2 3 4 5 6 7 
Region! 28 40 28 17 8 2 1 1 m=125 
Region2 14 25 30 18 49 2 1 1 n=140 


122. 


123. 


124. 


Referring to the previous exercise, develop 
a large-sample confidence interval formula 
for 4; — fy. Calculate the interval for the 
data given there using a confidence level of 
95%. 


Refer back to the pooled tf procedures 
described at the end of Section 10.2. The 
test statistic for testing Ho: “4, — HW, = Ao is 


7 (X—Y)—Ag (X—Y)—Ao 
P = 
Ete SVG +a) 
Show that when jp, —p, =A’ (some 


alternative value for the difference), then 
T, has a noncentral ¢ distribution with 
df =m +n -— 2 and noncentrality parameter 


_ N= Ao 
oft 


[Hint: Look back at Exercises 39-40, as 
well as Chapter 9 Exercise 38.] 


) 


Let R, be a rejection region with signifi- 
cance level « for testing Ho,: 0 € Q; versus 
Ay: 0 € Qy, and let Ry be a level & rejec- 
tion region for testing Ho2: 8 € Q» versus 
Ay: 0 € Qo, where Q; and Q, are two 
disjoint sets of possible values of 0. Now 
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consider testing Hp: 0 € Qy U Qy versus 
the alternative H,: 0 ¢ Q) U Qs. The pro- 
posed rejection region is Ry M Ro. That is, 
Ho is rejected only if both Ho; and Ho2 can 
be rejected. This procedure is called a 
union—intersection test (UIT). 


a. Show that the UIT is a level « test. 

b. As an example, let 4 denote the mean 
value of a particular variable for a gen- 
eric (test) drug, and fe denote the mean 
value of this variable for a brand-name 
(reference) drug. In bioequivalence 
testing, the relevant hypotheses are 
Ao: blr < oO, OF Uplug = dy (the 
two aren’t bioequivalent) versus 
A, OL < MR < Oy (bioequivalent). 
The limits 6; and oy are standards set by 
regulatory agencies; the FDA often uses 
.80 and 1.25 = 1/.8, respectively. By 
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taking logarithms and letting 7 = In(w), 
t =In(0), the hypotheses become 
Ao: either yr — nr <Tr OF > Ty Versus 
Aa Tp <4r—Nr < Ty. With this setup, 
a type I error involves saying the drugs 
are bioequivalent when they are not. 
The FDA mandates « = .05. 

Let D be an estimator of yn7 — yr with 
standard error Sp such that standardized 
variable T = [D — (yr — nr)\/Sp has a 
t distribution with v df. The standard test 
procedure is referred to as TOST for 
“two one-sided tests” and is based on 
the two test statistics Ty = (D — ty)/Sp 
and T; = (D — t,)/Sp. If v = 20, state 
the appropriate conclusion in each of 
the following cases: (1) t, = 2.0, ty = 
—1.5; (2) t, = 1.5, ty = —2.0; (3) tr 
= 2.0, ty = —2.0. 


®) 


Check for 
updates 


Introduction 

In studying methods for the analysis of quantitative data, we first focused on problems involving a 
single sample of numbers and then turned to a comparative analysis of two different samples. Now we 
are ready for the analysis of several samples. 

The analysis of variance, or more briefly ANOVA, refers broadly to a collection of statistical 
procedures for the analysis of quantitative responses. The simplest ANOVA problem is referred to 
variously as a single-factor, single-classification, or one-way ANOVA and involves the analysis of 
data sampled from two or more numerical populations (distributions). The characteristic that labels 
the populations is called the factor under study, and the populations are referred to as the levels of the 
factor. Examples of such situations include the following: 


1. An experiment to study the effects of five different brands of gasoline on automobile engine 
operating efficiency (mpg) 

2. An experiment to study the effects of four different sugar solutions (glucose, sucrose, fructose, and 
a mixture of the three) on bacterial growth 

3. An experiment to investigate whether hardwood concentration in pulp has an effect on tensile 
strength of bags made from the pulp 

4. An experiment to decide whether the color density of fabric specimens depends on the amount of 
dye used 


In (1) the factor of interest is gasoline brand, and there are five different levels of the factor. The 
factor in (2) is sugar, with four levels (or five, if a control solution containing no sugar is used). The 
factor in both of these first two examples is categorical in nature, and the levels correspond to possible 
categories of the factor. In (3) and (4), the factors are concentration of hardwood and amount of dye, 
respectively; both these factors are quantitative in nature, so the levels identify different settings of the 
factor. When the factor of interest is quantitative, statistical techniques from regression analysis 
(discussed in Chapter 12) can also be used to analyze the data. 

Here we first introduce single-factor ANOVA. Section 11.1 presents the F test for testing the null 
hypothesis that the population means are identical. Section 11.2 considers further analysis of the data 
when Hp has been rejected. Section 11.3 covers some other aspects of single-factor ANOVA. Many 
experimental situations involve studying the simultaneous impact of more than one factor. Various 
aspects of two-factor ANOVA are considered in the last two sections of the chapter. 
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11.1 Single-Factor ANOVA 


Single-factor ANOVA focuses on a comparison of two or more populations. For example, 
McDonalds may wish to compare the average revenue associated with three different advertising 
campaigns, or a team of animal nutritionists may carry out an experiment to compare the effect of five 
different diets on weight gain, or FedEx may want to compare the strengths of cardboard shipping 
boxes from four different vendors. Let 


I =the number of populations/treatments being compared 
Lt, = the mean of population 1 (or the true average response when treatment | is applied) 


ft, = the mean of population / (or the true average response when treatment J is applied) 


Then the hypotheses of interest are 


Ao: My = fy = = Hy 
versus 


H,: at least two of the y,;’s are different 


If J = 4, Apo is true only if all four p;’s are identical. H, would be true, for example, if 4) = U2 ~ 
b3 = La, if Wy = 3 = La # Lb, or if all four y;’s differ from each other. A test of these hypotheses 
requires that we have available a random sample from each population or treatment. Since ANOVA 
focuses on a comparison of means, you may wonder why the method is called analysis of variance 
(actually, analysis of variability would be a better name). The following example illustrates why it is 
appropriate to consider variability. 


Example 11.1 The article “Compression of Single-Wall Corrugated Shipping Containers Using 
Fixed and Floating Test Platens” (J. Test. Eval. 1992: 318-320) describes an experiment in which 
several different types of boxes were compared with respect to compression strength (Ib). Table 11.1 
presents the results of an experiment involving J = 4 types of boxes (the sample means and standard 
deviations are in good agreement with values given in the article). 


Table 11.1 The data and summary quantities for Example 11.1 


Type of box Compression strength (Ib) Sample mean Sample SD 

1 655.5 788.3 734.3 713.00 46.55 
721.4 679.1 699.4 

2 789.2 7725 786.9 756.93 40.34 
686.1 732.1 774.8 

3 737.1 639.0 696.3 698.07 37.20 
671.7 117.2 727.1 

4 535.1 628.7 542.4 562.02 39.87 
559.0 586.9 520.0 

Grand mean = 682.50 


With yu; denoting the true average compression strength for boxes of type i (i = 1, 2, 3, 4), the null 
hypothesis is Hp: Wy = Ho = 3 = Ly. Figure 11.la shows a comparative boxplot for the four samples. 
There is a substantial amount of overlap among observations on the first three box types, but 
compression strengths for the fourth type appear considerably smaller than for the others. This 
suggests that Ho is not true. 
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400 500 600 700 800 900 


Figure 11.1 Boxplots for Example 11.1: (a) original data; (b) data with one mean altered; 
(c) data with standard deviations altered 
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The comparative boxplot in Figure 11.1b is based on adding 120 to each observation in the fourth 
sample (giving a mean of 682.02 and the same standard deviation) and leaving the other samples 
unaltered. Because the sample means are now closer together, it is no longer obvious whether Ho 
should be rejected. 

Lastly, the comparative boxplot in Figure 11.1c is based on inflating the standard deviation of each 
sample while maintaining the values of the original sample means. Once again, it is unclear from the 
graph whether Hp is true, even though now the sample means are separated by the same amounts as 
they were in Figure |1.la. 

These graphs suggest that it’s insufficient to consider only how far apart the sample means are in 
assessing whether the population means are different; we must also account for the amount of 
variability within each of the J samples. a 


Notation and Assumptions 

In two-sample problems, we used the letters X and Y to designate the observations in the two samples. 
Because this is cumbersome for three or more samples, it is customary to use a single letter with two 
subscripts. The first subscript identifies the sample number, corresponding to the population or 
treatment being sampled, and the second subscript denotes the position of the observation within that 
sample. Let 


X;; = the random variable denoting the jth measurement from the ith population or treatment 
Xj = the observed value of X;; when the experiment is performed 

The observed data is often displayed in a rectangular table, such as Table 11.1. There, samples from 
the different populations appear in different rows of the table, and x; is the jth number in the ith 
sample. For example, x23 = 786.9 (the third observation from the second population), and 
X41 = 535.1. When there is potential ambiguity, we will write x;; rather than x; (e.g., if there were 15 
observations on each of 12 treatments, x;;2 could mean xj,)9 or xj;,2). It is assumed that the X;;’s 
within any particular sample are independent—a random sample from the ith population or treatment 
distribution—and that different samples are independent of each other. 

In some studies, different samples contain different numbers of observations. However, the con- 
cepts and methods of single-factor ANOVA are most easily developed for the case of equal sample 
sizes, known as a balanced study design. Unequal sample sizes will be considered in Section 11.3. 
Restricting ourselves for the moment to balanced designs, let J denote the number of observations in 
each sample (J = 6 in Example 11.1). The data set consists of n = IJ observations. The sample means 
will be denoted by X1.,X2.,...,X,.. That is, 


For the strength data in Table 11.1, x;. = 713.00, x2. = 756.93, x3. = 698.07, x4. = 562.02, and 
X_ = 682.50. Additionally, let S7,S3,...,57 represent the sample variances: 
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2 
ee ae 


1 J 
j= 


(GQaky 2 19) nol 
1 


From Example 11.1, 5; = 46.55, st = 2166.90, and so on. 


ANOVA ASSUMPTIONS The J population or treatment distributions are all normal with the same 
variance a”. That is, the X;;’s are independent and normally distributed 
with 


E(Xij) =m V(X) = 


The plausibility of independence of the samples (and of the individual observations within the 
samples) stems from a study’s design. At the end of this section we discuss methods for checking the 
plausibility of the normality and equal variance assumptions. 


Sums of Squares and Mean Squares 

Example 11.1 suggests the need for two distinct measures of variation: between-samples variation 
(i.e., the disparity between the 7 sample means) and within-samples variation (assessing variation 
separately within each sample and then combining). The test procedure we will develop shortly is 
based on the following measures of variation in the data. 


DEFINITION A measure of between-samples variation is the treatment sum of squares 
SSTr, given by 


ssTr = $750 (X -X.)" = os (K. — X.)’ 
=s[(®.-¥)P +--+ -¥)’] 


A measure of within-samples variation is the error sum of squares SSE, 
given by 


SSE = » > (Xi = x.y a S- [(J _ 1)S7] 


i 


= (J —1) [ST +55 +--+ +57] 


Thus SSTr assesses variation in the means from the different samples, whereas SSE entails assessing 
variability within each sample separately (via the sample variance) and then combining these 
assessments. 
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Example 11.2 (Example 11.1 continued) For the box compression data displayed in Figure 11.1a, 


SSTr = 6(713.00 — 682.50)* + 6(756.93 — 682.50)* + 6(698.07 — 682.50)* + 6(562.02 — 682.50)* 
= 127,374.7 


while 


SSE = (6 — 1)(46.55)° + (6 — 1)(40.34)? + (6 — 1)(37.20)? + (6 — 1)(39.87)° 
= 33, 838.4 


Similar computations can be applied to the two modified versions of the original data; the various 
sums of squares are summarized below. 


Figure l1.la Figure 11.1b Figure |1.1c 
SSTr 127,374.7 14,004.6 127,374.7 
SSE 33,838.4 33,838.4 211,488.0 


For the altered data on which Figure 11.1b is based, x4. = 682.02 and the revised grand mean is 
x, = 712.50. This greatly reduces SSTr, reflecting the fact that the sample means are less far apart in 
Figure 11.1b than in Figure 11.la. Since the standard deviations of the samples were not changed, 
SSE for the data displayed in Figure 11.la, b are identical. 

For the data used to construct Figure 11.1c, the value of SSTr is unchanged from Figure 11.la—the 
X;.’S were not altered, so this measure of between-samples variation stays the same. On the other hand, 
SSE for Figure 11.1c is much larger than for the actual data. Since the altered data exhibits much 
greater within-samples variation than does the actual data, the corresponding SSE should be corre- 
spondingly greater. a 


A descriptive understanding of the treatment and error sums of squares is provided by the 
following fundamental identity. 


THEOREM 
(Fundamental 
ANOVA Identity) 


SSTr+ SSE = SST 


where SST is the total sum of squares given by 


SST = a DY @- #8.) 


The proof of the identity follows from squaring both sides of the relationship 
xy —%. = (xy — 3%.) + — 2.) (11.1) 


and summing over all i and j. This gives SST on the left and SSTr and SSE as the two extreme terms 
on the right; the cross-product term is easily seen to be zero (Exercise 13). 

The interpretation of the fundamental identity is an important aid to understanding ANOVA. SST 
is a measure of the total variation in the data—the sum of all squared deviations about the grand 
mean. The identity says that this total variation can be partitioned into two pieces. SSTr is the amount 
of variation (between samples) that can be explained by possible differences in the 1;’s: when the 1;’s 
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differ substantially, the individual sample means should be further from the grand mean than when the 
Lis are identical or close to one another. SSE measures variation that would be present (within 
samples) even if Ho were true; thus, SSE the part of total variation that is unexplained by the truth or 
falsity of Hp. If explained variation is large relative to unexplained variation, then Hp should be 
rejected in favor of Hi. 

Formal inference will require the sampling distributions of both the statistics SSTr and SSE. Recall a 
result from Section 6.4: if X,, ..., X, is a random sample from a normal distribution, then the 
sample mean X and the sample variance S? are independent. Also, X is normally distributed, and 
(n — 1)S?/o? has achi-squared distribution with n — 1 df. Similar results hold in our ANOVA situation. 


THEOREM When the ANOVA assumptions are satisfied, 


1. SSE and SSTr are independent random variables. 
2. SSE/o” has a chi-squared distribution with LJ — I df. 


Furthermore, when Ho: “, = --: = Ly is true, 


3. SSTr/o” has a chi-squared distribution with 7 — 1 df. 


Proof Independence of SSTr and SSE follows from the fact that SSTr is based on the individual 
sample means whereas SSE is based on the sample variances, and X;. is independent of Ss for each 
i. Next, SSE/o” can be expressed as the sum of chi-squared rvs: 


SSE J—1)S? J —1)S? 
Es ge 
Oo oO Oo 


Each term in the sum has a Yea distribution, and dfs add because the samples are independent. Thus 
SSE/c? also has a chi-squared distribution, with df= (J-1)+--+VJU-D=IJ-)D=N-1. 

Now suppose Hp is true and let Y; = X;. for i=1,...,/. Then Y,, Y>, ..., Y; are independent and 
normally distributed with the same mean and with variance o*/J. Thus, by the key result from 
Section 6.4, (I — 1)S?/(o?/J) has a chi-squared distribution with J — 1 df when Hp is true. 
Furthermore, 


(i—1)S} — @- DJ > y SSTr 
= a ry Ty - , 


a? /J o o 


so under Ho, SSTr/o? ~ y7_,. a 


SSTr and SSE provide measures of between- and within-samples variability, respectively, but they 
are not (yet) directly comparable. Analogous to the definition of sample variance in Chapter 1, 
wherein a sum of squares was divided by its degrees of freedom, we make the following definitions. 


DEFINITION The mean square for treatments MSTr and the mean square for error MSE 
are 


T E 
SSTr MSE = SS 


MSTr = 
ede U1 
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The word “mean” is again being used in the sense of average; a mean square is a sum of squares 
divided by its associated degrees of freedom. The next proposition sets the stage for our ultimate 
ANOVA hypothesis test. 


PROPOSITION When the ANOVA assumptions are satisfied, 
E(MSE) = a”; that is, MSE is an unbiased estimator of 0”. 
Moreover, when Ho: 4, = --: = Ly is true, 
E(MSTr) = a’; in this case, MSTr is also an unbiased estimator of 7. 


Proof The expected value of a chi-squared variable with v df is just v. Thus, from the previous 
theorem, 


E E 
(=) = IJ —I > E(MSE) =) =o 
a IJ —1 


and 


T T 
Hy true > (5% ") =] —1 => E(MSTr) = (F *) =0 " 


MSTr is unbiased for o* when Hp is true, but what about when Hp is false? It can be shown 
(Exercise 14) that in this case, E(MSTr) > a’. This is because the X ;.s tend to differ more from each 
other, and therefore from the grand mean, when the j;’s are not identical than when they are the same. 


The F Test 

It follows from the preceding discussion that when Hp is true the values of MSTr and MSE should be 
close to each other. Equivalently, the ratio of these two quantities should be relatively near 1. On the 
other hand, if Ho is false then MSTr ought to exceed MSE, so their ratio will tend to exceed 1. This 
suggests that a sensible test statistic for ANOVA is MSTt/MSE, but how large must this be to provide 
convincing evidence against Hj? Answering this question requires knowing the sampling distribution 
of this ratio. 

In Section 6.3 we introduced a family of probability distributions called F distributions, which became 
the basis for inference on the ratio of two variances in Section 10.5. If Y, and Y> are two independent 
chi-squared random variables with v, and v2 df, respectively, then the ratio F = (Y,/v,)/(Y2/v2) has an 
F distribution with v; numerator df and v2 denominator df. Appendix Table A.8 gives F critical values for 
a = .10, .05, .01, and .001. Values of v, are identified with different columns of the table and the rows are 
labeled with various values of v2. For example, the F critical value that captures upper-tail area .05 under 
the F curve with vy, = 4 and v2 = 61s F95.4.6 = 4.53, whereas F564 = 6.16 (so don’t accidentally switch 
numerator and denominator df!). The key theoretical result that justifies the ANOVA test procedure is that 
the test statistic MSTr/MSE has an F distribution when A is true. 
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PROPOSITION When the ANOVA assumptions are satisfied and Ho: uw; = --- = p, is true, the 
test statistic F = MSTr/MSE has an F distribution with J — 1 numerator df and 
IJ — I denominator df. 


This theorem follows immediately from rewriting F as 


sre 


Siw 
and then applying the definition of the F distribution along with properties of SSTr and SSE 
established earlier in this section. Here, finally, is the test procedure. 


F TEST FOR SINGLE-FACTOR Null hypothesis: Ho: wy, = +++ = fy 
ANOVA Alternative hypothesis: H,: not all of the y;’s are equal 


Test statistic value: f = Ms qr = SEH 


Rejection region for level « test: f > Fyjy—1yy-1 
P-value calculation: area under F7_,,7;-; curve to the right of f 


Refer to Section 10.5 to see how P-value information for F tests can be obtained from the table of 
F critical values. Alternatively, statistical software packages will automatically include the P-value 
with ANOVA output. 

The computations building up to the test statistic value f are often summarized in a tabular format, 
called an ANOVA table, as displayed in Table 11.2. Tables produced by statistical software 
customarily include a P-value column to the right of the f column. 


Table 11.2 An ANOVA table 


Source of variation df Sum of squares Mean square f 
Treatments i= 1 SSTr MSTr = SSTr/ — 1) MSTr/MSE 
Error IJ-1 SSE MSE = SSE/(J — D 

Total IJ -1 SST 


Example 11.3 With the ever-increasing power demand driven by everyone’s electronic devices, 
engineers have begun exploring ways to tap into the energy discharge from household items—a hot 
stove pan, a candle flame, or even a hot soup bowl. The article “Low-Power Energy Harvesting of 
Thermoelectric Battery Charger with Step-Up DC-—DC Converter: Applicable Case Study for Per- 
sonal Electronic Gadgets” (J. Energy Engr. 2017) describes an experiment to compare the charging 
characteristics of five thermoelectric modules under certain conditions. Consider the accompanying 
data on the maximum power per unit area (mW/cm?) for J = 4 replications of the experiment on each 
of the 7 = 5 modules. 
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Module Max. power per cm? Xi. Sj 

1 98.8 93.1 96.6 91.8 95.08 3.21 
2 82.5 87.5 88.8 91.3 87.53 3.70 
3 77.7 74.7 76.2 78.8 76.85 1.79 
4 82.6 80.5 82.8 84.1 82.50 1.49 
5 91.9 87.5 86.9 90.0 89.08 2.31 


Let ju; denote the true mean max power per unit area when module / is used (i = 1, 2, 3, 4, 5). The 
null hypothesis Ho: (1 = fo = M3 = M4 = Hs States that the true average is identical for the five 
modules. Let’s carry out a test at significance level .01 to see whether Ho should be rejected in favor 
of the assertion that true average max power per unit area is not the same for all modules. (At this 
point the plausibility of the normality and equal variance assumptions should be checked; we defer 
those tasks to later in this section.) 

Since 7- 1 =5 —1=4 and IJ —- I= 20 —5 = 15, the F critical value for the rejection region is 
F 01,4,15 = 4.89. The grand mean for the 20 observations is X.. = 86.20 mW/ cm?. The treatment and 
error sums of squares are 


SSTr = 4| (95.08 — 86.20)? + --- + (89.08 — 86.20)"| = 759.6 


SSE = (4-1) [(3.21)7 Folas (2.31) = 104.2 


The remaining computations are summarized in the accompanying ANOVA table. Because f = 27.33 
> Fo1,4,15 = 4.89, Ho is rejected at significance level .01. The P-value is the area under the Fy ;5 curve 
to the right of 27.33, which is 0 to several decimal places (and, in particular, far less than « = .01). The 
modules clearly do not yield the same mean maximum power per cm? in size. 


Source of variation df Sum of squares Mean square f P-value 
Treatments 4 759.6 189.90 27.33 <.0001 
Error 15 104.2 6.95 

Total 19 863.8 


When the F test causes Hp to be rejected as in Example 11.3, researchers will naturally be interested 
in further analysis to decide which p;’s differ from which others. Methods for doing this are called 
multiple comparison procedures and are described in the next two sections. 


Checking the ANOVA Assumptions 

In previous chapters, a normal probability plot was suggested for checking normality. The individual 
sample sizes in ANOVA are typically too small for J separate plots to be informative. A single plot 
can be constructed by first subtracting x,. from each observation in the first sample, x2. from each 
observation in the second, and so on. These deviations are called residuals and are defined for single- 
factor ANOVA by 


residual = ej = xj — Xi. 
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There are a total of LJ residuals, one for each observation. Table 11.3 shows the residuals for the 24 
observations in Example 11.1. For instance, the first residual was computed as 
ei) =X. — Xy. = 655.5 — 713.0 = —57.5. Figure 11.2 displays a normal probability plot of these 
residuals. The straightness of the pattern gives strong support to the normality assumption. An 
analogous plot for the data of Example 11.3 conveys the same message. 


Table 11.3 Residuals for the data in Example 11.1 


Type of box Residual 
1 —57.50 75.30 21.30 
8.40 —33.90 —13.60 
2 32.27 15.57 29.97 
—70.83 —24.83 17.87 
3 39.03 —59.07 -1L77 
—26.37 19.13 29.03 
4 —26.92 66.68 —19.62 
—3.02 24.88 —42.02 


Residual 


z percentile 


-1.4 —7 0 7 1.4 


Figure 11.2 A normal probability plot based on the data of Example 11.1 


The other ANOVA assumption is that the populations have equal variances. A popular informal 
tule is that if the largest sample standard deviation is not much more than twice the smallest one, it is 
permissible to assume equal variances. This is especially true for balanced or nearly-balanced study 
designs. In Example 11.1, the largest s is only about 1.25 times the smallest. Example 11.3 violates 
this informal rule slightly—the ratio of the largest s and smallest s is 3.70/1.49 = 2.48—but, again, 
balance (i.e., equal sample sizes) makes this disparity somewhat less important. 

Several formal tests of equal variance have been devised. If the likelihood ratio principle is applied 
to the problem of testing for equal variances for normal data, then the result is Bartlett’s test. This is a 
generalization of the F test for equal variances given in Section 10.5, and it is very sensitive to the 
normality assumption. Since the ANOVA F test is robust in the presence of “mild” nonnormality (the 
significance level is approximately correct), it would be unfortunate to have the equal variances 
assumption invalidated not because they are different but because of such nonnormality. Levene’s 
test is much less sensitive to the assumption of normality. Essentially, this test involves performing an 
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ANOVA on the absolute values of the residuals. That is, Levene’s test performs an ANOVA F test 
using the absolute residuals |ei;| in place of x;. The idea is to use absolute residuals to compare the 
variability of the samples. 


Example 11.4 To apply Levene’s test to the data from Example 11.1, we first take the absolute 
values of the 24 residuals in Table 11.3. Then we apply ANOVA to the absolute residuals. With the 
aid of software, 


SSTr = 115.3 MSTr = 115.3/3 = 3844 f=0.08 
SSE = 9728.7 MSE = 9728.7/20 = 486.44 


Compare 0.08 to the critical value F.19,3,99 = 2.38. Because 0.08 is much smaller than 2.38, there is 
no reason to doubt that the population variances are equal. 

We were somewhat more concerned about the power data from Example 11.3, since the sample 
standard deviations were rather different. Computing the residuals, taking their absolute values, and 
applying ANOVA to the results give 


SSTr = 7.903 MSTr =7.903/4=1.976 f=1.17 
SSE = 25.265 MSE = 25.265/15 = 1.684 


Because f= 1.17 < F.19.4.15 = 2.36, we do not reject a null hypothesis of equal population variances at 
the .10 level (in fact, the P-value is .362). There was no need to worry. is} 


Given that the absolute residuals are certainly not normally distributed, it might seem questionable 
to subject them to ANOVA. Fortunately, Levene’s test works in spite of the normality assumption. 
A common sample size of 10 is sufficient for excellent accuracy in Levene’s test, but smaller samples 
can still give useful results when only approximate P-values are needed (i.e., when the Levene’s test 
P-value falls far above or far below the chosen significance level). 

Some software packages perform Levene’s test, but they will not necessarily get the same answer 
because they do not necessarily use absolute deviations from the mean. For example, Minitab uses 
absolute residuals with respect to the median, an especially good idea in case of skewed data. By 
default, SAS uses the squared deviations from the mean, although the absolute deviations from the 
mean can be requested. SAS also allows absolute deviations from the median (as the “BP” test, 
because Brown and Forsythe studied this procedure). 

The ANOVA F test is robust not only to mild departures from normality but also to mild 
departures from equal variances. When the sample sizes are all the same, as we are assuming so far, 
the test is especially insensitive to unequal variances. Also, there is a generalization of the two-sample 
t test of Section 10.2 for more than two samples, and it does not demand equal variances. This test is 
available in JMP, R, and SAS. 

If there is a major violation of assumptions, then the situation can sometimes be corrected by a data 
transformation, discussed in Section 11.3. Alternatively, the bootstrap can be used, by generalizing 
the method of Section 10.6 from two groups to several. There is also a nonparametric version of 
ANOVA (meaning no normality required) called the Kruskal-Wallis test, developed in Chapter 14. 
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Exercises: Section 11.1 (1-14) 


1. An experiment to compare J = 5 brands of true mean elastic modulus of apple pieces 


golf balls involved using a robotic driver to 
hit J = 7 balls of each brand. The resulting 
between-sample and within-sample esti- 
mates of o*? were MSTr= 123.50 and 
MSE = 22.16, respectively. 


a. State and test the relevant hypotheses 
using a significance level of .05. 

b. What can be said about the P-value of 
the test? 


. The lumen output was determined for each 
of = 3 different brands of 60-watt soft- 
white lightbulbs, with J = 8 bulbs of each 
brand tested. The sums of squares were 
computed as SSE = 4773.3 and SSTr = 
591.2. State the hypotheses of interest (in- 
cluding word definitions of parameters), 
and use the F test of ANOVA (a = .05) to 
decide whether there are any differences in 
true average lumen outputs among the three 
brands for this type of bulb by obtaining as 
much information as possible about the 
P-value. 


. Freezing and thawing out food can 
adversely affects its texture. In one experi- 
ment described in the article “Effects of 
Freezing Treatments Before Convective 
Drying on Quality Parameters” (J. of Food 
Engr. 2019: 15-24), apple pieces were 
frozen using a —20 °C freezer (F20), a 
-80 °C freezer (F80), or liquid nitrogen 
(FLN). After being thawed out, texture tests 
were performed on each piece. The fol- 
lowing information summarizes the elastic 
modulus (kPa) of the apple pieces. (Elastic 
modulus measures the apple pieces’ resis- 
tance to deformation under load.) 


Freezing method J Xi: Sj 
F20 8 61 10 
F80 8 73 20 
FLN 8 49 10 


Assuming conditions are met, use the 
ANOVA F test at level .05 to decide 
whether there are any differences between 


using the three freezing methods. 


. The article “Load-Carrying Capacity of 


Lengthwise Cracked Wood Beams Retro- 
fitted by Self-Tapping Screws” (J. Struct. 
Engr. 2017) provides data on the maximum 
load (KN) of J = 5 specimens of J = 4 types 
of wood beams used in housing: intact 
beams (L1), long, centered cracks (L2), 
short, centered cracks (L3), and long, off- 
center cracks (L4). 


Beam type Maximum load (kN) 

LI 32.53 19.18 7.50 18.18 21.89 
L2 13.15 10.62 6.75 16.08 14.12 
L3 13.25 18.52 12.02 18.83 12.80 
L4 26.95 13.19 11.55 24.63 23.63 


Use a significance level of .05 to test the null 
hypothesis of no difference in true average 
maximum load for these four beam types. 


. The article “Differences in Impact Perfor- 


mance of Bicycle Helmets During Oblique 
Impacts” (J. Biomech. Engr. 2018) describes 
an experiment in which 10 different bicycle 
helmet brands (4 of each brand, for 40 total 
helmets) were strapped onto a mannequin 
head and subjected to a frontal impact at 
6.6m/s. At that speed, concussion without a 
helmet is extremely likely. The peak linear 
acceleration (PLA, in g) was measured in 
each test; values over 300 g are associated 
with a high risk of brain injury. 


a. The sample mean PLAs for the 10 hel- 
met brands are presented below; labels 
are abbreviations used in the article for 
the brand names. 


BMIPS BSF BSP CW GMIPS 


141 144 122 127 148 
GS NW SOO ST SWE 
175 147 117 142 129 


Use these sample means to determine 
SSTr and MSTr. 

b. The value SSE = 3900 is consistent with 
information provided in the article. 
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Construct an ANOVA table, and carry 
out a hypothesis test at the .01 signifi- 
cance level. 


In an experiment to investigate the perfor- 
mance of four different brands of spark 
plugs intended for use on a 125-cc two- 
stroke motorcycle, five plugs of each brand 
were tested for the number of miles (at a 
constant speed) until failure. The partial 
ANOVA table for the data is given here. 
Fill in the missing entries, state the relevant 
hypotheses, and carry out a test by obtain- 
ing as much information as you can about 
the P-value. 


Source df 


Brand 
Error 
Total 


Sum of squares Mean square f 


14,713.69 
310,500.76 


. Consider the box compression data pre- 


sented in Example 11.1. Carry out an 
analysis of variance F test at significance 
level .01, and summarize the results in an 
ANOVA table. 


. Plastic waste, particularly microplastics in 


oceans and waterways, has become an 
increasing global environmental concern. 
The article “Environmentally Relevant Con- 
centrations of Microplastic Particles Influ- 
ence Larval Fish Ecology” (Science, 3 June 
2016: 1213-1216) describes an experiment in 
which fertilized egg strands of European 
perch were placed in 15 identical tanks. Tanks 
were then randomly assigned (1) no 
microplastics, (2) a “typical” microplastic 
concentration (10,000 particles/m? ), or (3) a 
high concentration (80,000 particles/m*). 
After a three-week period, the successful 
hatching rates were recorded for every tank; 
the data appears below. 


Microplastic level Hatching success rate 


None 95 98 96 92 97 
Typical 86 93 88 87 91 
High 85 74 86 77 83 


Does the data provide convincing statistical 
evidence that microplastic level has an 


10. 
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effect on the success rate of eggs hatching 
for European perch? Test at the .01 signif- 
icance level. 


. One popular approach to lower back pain is 


to apply a piece of durable tape along the 
lower spine. In one study, 108 women with 
lower back pain were randomly assigned to 
receive one of four treatments: Kinesio tape 
applied to a tense back (KTT), Kinesio tape 
without any back tension (KTNT), Micro- 
pore tape (MP), and no tape (CG, a control 
group). After 10 days of treatment, each 
woman’s lower back extension (degrees) 
was measured. The accompanying table 
summarizes the results. 


Treatment J Xi. Si 
KTT 27 30° 14° 
KTNT 27 29° 15° 
MP 27 26° 13° 
CG 27 27° 9° 
(“Kinesio Taping Reduces Pain and 
Improves Disability in Low Back Pain 


Patients,” Physiotherapy 2019: 65-75.) 


a. Calculate SSTr, MSTr, SSE, and MSE. 
b. Test the null hypothesis that true mean 
back extension after 10 days is the same 
for all four treatments, at the .05 level. 


Does music affect memorization skills, and 
does it matter if the music includes vocals/ 
lyrics? In one experiment (P. Ramos, “The 
Impact of Music on Short Term Memory 
and Cognitive Processes,” Univ. of 
Bridgeport, 2017), subjects were randomly 
assigned to one of four environments: 
(A) an instrumental-only version of Adele’s 
“Million Years Ago” playing in back- 
ground (B) a vocals-only version, (C) a 
version with both instruments and vocals, 
and (D) a control group with no ambient 
music. All subjects were given a list of 
everyday words to memorize in 90 seconds 
and then asked to write down as many 
words as they could remember. Information 
consistent with that report appears below. 


11.1 


11. 


12. 
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Single-Factor ANOVA 


Number of words 


remembered 
Condition J Xi Sj 
Instrumental only 26 10.2 2.8 
Vocals only 26 Wd 2 
Instruments & vocals 26 9.0 2.5 
Control (no music) 26 10.7 3.0 


Test whether music environment affects 
memorization skills, as measured by pop- 
ulation mean number of words remembered 
using this activity, at the .01 significance 
level. 


The article referenced in Example 11.3 also 
looked at the time (min) to charge a 4.2-V 
battery. 


Module Time to full charge (min) 

1 200 199 204 208 
2 233 229 226 224 
3 140 146 146 136 
4 169 174 171 166 
5 205 212 214 208 


a. Check the ANOVA assumptions with a 
normal plot and a test for equal variances. 

b. Does mean charge time differ across 
these five thermoelectric modules? State 
and test the relevant hypotheses using 
a= .01. 


Six samples of each of four types of cereal 
grain grown in a certain region were ana- 
lyzed to determine thiamin content, result- 
ing in the following data (ug/g): 


13. 


14. 
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Grain Thiamin content 


Wheat 3:2 4.5 6.0 6.1 6.7 5.8 
Barley 6.5 8.0 6.1 is) 5.9 5.6 
Maize 5.8 4.7 6.4 4.9 6.0 3.2 
Oats 8.3 6.1 7.8 7.0 5.5 7.2 


a. Check the ANOVA assumptions. 

b. Test to see if at least two of the grains 
differ with respect to true average thi- 
amin content. Use an « = .05 test based 
on the P-value method. 


Derive the fundamental identity SST = 
SSTr + SSE by squaring both sides of 
Equation (11.1) and summing over all 7 and 
j. (Hint: For any particular i, 


Yj (yj — F) = 0.) 
In single-factor ANOVA with J treatments 
and J observations per treatment, let 1 = 


(1/1) Do ai. 

a. Express E(X..) in terms of pw. [Hint: 
X.. = (1/1I 2 Xi] 

b. Express E (x;) in terms of o and the 
u,’s. (Hint: For any rv Y, E(Y*) = 
V(¥) + [E(¥)] I 

c. Express E (x) in terms of y and o. 


d. Express E(SSTr) in terms of wu, o, and 
the p,;’s. Then show that 


J 2 
E(MSTr) = o? + —— — 
(MSTr) = 0° + = D1 (ui - #) 
e. Using the result of part (d), what is 
E(MSTr) when Hp is true? When Ho is 
false, how does E(MSTr) compare to a? 


When the computed value of the F statistic in single-factor ANOVA is not significant, the analysis is 
terminated because no differences among the 1;’s have been identified. But when Hp is rejected, the 
investigator will usually want to know which of the y;’s are different from each other. A method for 
carrying out this further analysis is called a multiple comparisons procedure. 
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Several of the most frequently used such procedures are based on the following central idea. First 
calculate a confidence interval for each pairwise difference uu; — 4; with i <j. Thus if J = 4, the six 
required CIs would be for 4; — pz (but not also for wz — 1), fa — Hs, ba — Ma, bo — Hs, La — Ha, and 
Hs — Hq. Then if the interval for 4; — My does not include 0, conclude that i; and fl differ significantly 
from each other; if the interval includes 0, the two y's are judged not significantly different. 
Following the same line of reasoning for each of the other intervals, we end up being able to judge for 
each pair of y;’s whether or not they differ significantly from each other. 

The procedures based on this idea differ in the method used to calculate the various CIs. Here we 


present a popular method that controls the simultaneous confidence level for all (3) = I(I-1)/2 


intervals calculated. 


Tukey’s Procedure 
Tukey’s procedure involves the use of another probability distribution. 


DEFINITION Let Z,, Z5,..., Z, be m independent standard normal rvs, and let W be a Ve Iv 
independent of the Z,’s. Then the distribution of 


= max|Z; - Z;| - max(Z,...,Zm) — min(Z),...,Zm) 


VW/y VWw/y 


is called the studentized range distribution. The distribution has two parameters: 
m = the number of Z,’s and v = denominator df. We denote the critical value that 
captures upper-tail area « under the density curve of Q by Q,.,,y. A tabulation of 
these critical values appears in Appendix Table A.9. 


The word “range” reflects the fact that the numerator of Q is indeed the range of the Z;’s. Dividing the range 


by ,\/W/v is the same as dividing each individual Z; by ,/W/v. But Z;/,/W/v has a (Student) t 


distribution’; “studentizing” refers to the division by \/W/v. So Q is actually the range of m variables 
that have the ¢ distribution (but they are not independent because the denominator is the same for each 
one). 

The identification of the quantities in the definition of Q with single-factor ANOVA is as follows: 


Xi. — Lb; SSE (IJ —I)MSE 
=—— m=I W ; 
o/VJ 0 0 


v=l-I 


i 


Substituting into Q gives 


X.— mi Xi — 
o/VI__ a /VI _ max|X; — Xj. — (ui — 4) | 
i= —I)MSE ED /MSE/J 


o2 


max 


Q= 


‘“Student” was the pseudonym used by the statistician Gossett, who derived the f distribution but published his work 
using the pseudonym “Student” because his employer, the Guinness Brewing Co., would not permit publication under 
his own name. 
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In this latter expression for Q, the denominator ,/MSE/J is the estimated standard deviation of 
X;. — ;. By definition of Q and Q,, P(Q<Q,) =1-— 4, so 


a (mi -%— (w-m)] 
MSE/J 


p{ Fe =%i. = (wi - 4) 
MSE/J 


(-0../MSE/J <X;. — Xj — (a; — o;) < Qu\/MSE/J for all i,j) 
(X; — Xj, — Qu/MSE/T <p; — 1j <i. — Xj. + Qy/MSE/J for alll, i) 


< 2.100-») 


< Qy.1.(5-1) for all uJ) 


P 
P 


l| 


(whew!). Replacing X;.,Xj;., and MSE by the values calculated from the data gives the following 
result. 


PROPOSITION For each i < j, form the interval 


%. —¥). + Our -1/MSE/J (11.2) 


There are J(J — 1)/2 such intervals: one for 4; — uo, another for , — ps, ..., 
and the last for uj-; — yy. Then the simultaneous confidence level that every 
interval includes the corresponding value of u; — py; is 10001 — «)%. 


Notice that the second subscript on Q, is J, whereas the second subscript on F, used in the ANOVA 
F test is T— 1. Q, 777-7 can be obtained in R with the command qtukey (1 - «, J, 7-1). 

We will say more about the interpretation of “simultaneous” shortly. Each interval that doesn’t 
include 0 yields the conclusion that the corresponding values of 1; and yu; are different—we say that 11; 
and yu; “differ significantly” from each other. 


Example 11.5 An experiment was carried out to compare five different brands of automobile oil 
filters with respect to their ability to capture foreign material. Let ; denote the true average amount of 
material captured by brand i filters (i = 1, ..., 5) under controlled conditions. A sample of J = 9 filters 
of each brand was used, resulting in the following sample mean amounts: xX). = 14.5, x. = 
13.8, X3. = 13.3, X4. = 14.3, and x5. = 13.1. We will assume for this example that the conditions for 
inference (approximate normality and equal variance) are met by the data set. Table 11.4 is the 
ANOVA table summarizing the first part of the analysis. 


Table 11.4 ANOVA table for Example 11.5 


Source of variation df Sum of squares Mean square f 
Treatments (brands) 4 13.32 3.33 37.84 
Error 40 3.53 .088 

Total 44 16.85 


Since f = 37.84 > F001,4,40 = 5.70, Ho is decisively rejected. 

We now use Tukey’s procedure to look for significant differences among the p;’s. From Appendix 
Table A.9, Q.05.5,40 = 4.04 (the second subscript on Q is 5 and not 5 — | as in F). Applying Equation 
(11.2), the CI for each pu; — py; is 
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(x;. — X).) + 4.04,/.088/9 = (x; — x;.) 0.4 


Due to the balanced design, the margin of error is the same on all 3) = 10 Tukey CIs; if the sample 


sizes differed, this would not be the case (see Section 11.3). The resulting CIs are displayed in 
Table 11.5; those marked with an asterisk do not include zero: 


Table 11.5 Tukey’s simultaneous confidence intervals for Example 11.4 


i J CI for pj; - 4; i J CI for pj; - 4; 
1 2 (0.3, 1.1)* 2 4 (-0.9, —0.1)* 
1 3 (0.8, 1.6)* 2 5 (0.3, 1.1)* 
1 4 (-0.2, 0.6) 3 4 (-1.4, —0.6)* 
1 5 (1.0, 1.8)* 3 Dd (—0.2, 0.6) 
2 3 (0.1, 0.9)* 4 5 (0.8, 1.6)* 


Thus brands 1 and 4 are not significantly different from one another, but they are significantly 
higher than the other three brands in their true average contents. Brand 2 is significantly better than 3 
and 5 but worse than 1 and 4, and brands 3 and 5 do not differ significantly. | 


While the CIs in Table 11.5 correctly indicate which means are believed to be significantly 
different, this display is rather unwieldy. It is preferable to list out the observed sample means (say, 
from smallest to largest) and somehow indicate which are “close enough” that the corresponding 
population means are not judged to be significantly different. The following box describes how 
nonsignificant differences can be identified visually using an “underscoring pattern.” 


Tukey’s Method 1. List the sample means in increasing order (make sure to identify the corre- 


for Identifying sponding population/treatment for each ¥;.). 
Significantly 2. Starting at the far left, use the Tukey intervals to determine which means are 
Different ,;’s not significantly different from the first one in the list. Underscore that set of 


means with a single line segment. 
3. Continue in this fashion for the second mean, third mean, etc., always underscoring 
in the rightward direction. Duplicate underscorings should only be drawn once. 


Any pair of sample means not underscored by the same line correspond to a 
pair of population or treatment means that are judged significantly different. 


In fact, it is not necessary to construct the Tukey intervals in order to perform step 2. Rather, the CI 
for Li; — Hj will include 0 if and only if x;. and x;. differ by less than Q, 7 1(7—1) \/ MSE/J, the margin of 
error of the confidence interval (11.2). This margin of error is sometimes referred to as Tukey’s 
honestly significant difference (HSD). 
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As an example, consider J = 5 with 
XQ. <X5. <<X4. <<Xy. <X3. 


Suppose the Tukey confidence intervals for 2 — 5 and py — [4 include zero, but those for fo — 4, and 
La — Ls do not. Then we draw a line segment starting from 2 and extending to 4: 


Group: 2 5 4 1 3 


Sample mean: Xo. Xs. X4. Xy. Xs. 


Next, suppose the mean for group 5 is not significantly different from that of group 4. Since we have 
already accounted for that set, no duplicate line is drawn. Finally, if groups 4 and | are not signif- 
icantly different, that pair is underscored: 


Group: 2 5 4 1 3 
Sample mean: X2. X5. X4. XX. %3. 


The fact that x3. isn’t underlined at all indicates that ju; is statistically significantly different from all 
other group means. 


Example 11.6 (Example 11.5 continued) The five sample means, arranged in order, are 


Brand of filter: 5 3 2 4 1 
Sample mean: 13.1 13.3 13.8 143 145 


Only two of the Tukey CIs included zero: the interval for 4) — 4 and that for 3 — 5. Equivalently, 


only those two pairs of means differ by less than the honestly significant difference 4.04,/.088/9 = 
4 The resulting underscoring pattern is 


Brand of filter: 5 3 2 4 1 
Sample mean: 13.1 13.3 13.8 143 145 


The mean for brand 2 is not underlined at all, since Ut was judged to be significantly different from all 
other means. 
If x2. = 13.6 rather than 13.8 with HSD = .4, the underscoring configuration would be 


Brand of filter: 5 3 2 4 1 
Sample mean: 13.1 13.3 13.6 143 145 


The interpretation of this underscoring must be done with care, since we seem to have concluded that 
brands 5 and 3 do not differ significantly, 3 and 2 also do not, yet 5 and 2 do differ. One could say 
here that although evidence allows us to conclude that brands 5 and 2 differ from each other, neither 
has been shown to be significantly different from brand 3. Hi 


Example 11.7 Almost anyone who attends physical therapy receives transcutaneous electrical nerve 
stimulation (TENS), possibly combined with a cold compress, to address pain. The article “Effect of 
Burst TENS and Conventional TENS Combined with Cryotherapy on Pressure Pain Threshold” 
(Physiotherapy 2015: 155-160) summarizes an experiment in which 112 healthy women were each 
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randomly assigned to one of seven treatments listed in the accompanying table (so, J = 16 for each 
treatment). After the treatment, researchers measured each woman’s pain threshold, in kg of force 
applied to the top of the humerus, resulting in the accompanying data. 


Pain threshold (kg of force) 


Treatment Xj. Si 
(1) Control 2.8 0.7 
(2) Placebo TENS 2.3 0.9 
(3) Conventional TENS 3.2 1.0 
(4) Burst TENS 4.3 0.9 
(5) Cryotherapy 4.4 0.9 
(6) Cryotherapy + burst TENS 5.7 0.8 
(7) Cryotherapy + conventional TENS 3.0 0.7 


Let yu; = the true mean pain threshold after the ith treatment (i = 1, ..., 7). We wish to test the null 
hypothesis Ho: “, =--- = H, against the alternative that not all y;’s are equal. From the sample 
means and standard deviations, SSTr = 133.67 and SSE = 75.75, giving the ANOVA table in 
Table 11.6. 

Since f = 30.88 > Fo16,105 = 2.98, Ho is rejected at the .01 level; in fact, P-value ~ 0. We 
conclude that the true mean pain threshold differs across the seven treatments. 


Table 11.6 ANOVA table for Example 11.7 


Source of variation df Sum of squares Mean square f 
Treatments 6 133.67 22.28 30.88 
Error 105 75.75 0.72 

Total 111 209.42 


Next, we apply Tukey’s method. There are J = 7 treatments and 105 df for error, so Q01,7,105 © 


5.02 (interpolating from Table A.9) and Tukey’s HSD = 5.02,/0.72/16 = 1.06. Ordering the means 
and underscoring yields 


27 WM @M GB) @ ~~ 6) 
23. 28 30 32 43 4.4 


ee ee, 
eR) 


Higher pain thresholds are considered better. In that respect, treatment 6 (cryotherapy plus burst 
TENS) is the clear winner, since the mean pain threshold under that treatment is highest and sig- 
nificantly different than all others. Treatments 4 and 5 (just burst TENS or just cryotherapy) are next- 
best and are not significantly different from each other. Finally, the other four treatments are not 
honestly significantly different—and, in particular, treatments 3 and 7 are comparable to the control 
and placebo groups. 

Many research journals and some statistical software packages express the underscoring scheme 
with letter groupings. Figure 11.3 (p. 659) shows Minitab output from this analysis. The A, B, C 
letter groupings correspond to the three nonoverlapping sets identified above: treatment 6, treatments 
4 and 5, and the rest. 
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Grouping Information Using the Tukey Method and 99% Confidence 


Treatment N Mean Grouping 


6 16 5.700 A 

5 16 4.400 B 

4 16 4.300 B 

3 16 3.200 C 
7 16 3.000 C 
1 16 2.800 C 
2 16 = 2.300 C 


Means that do not share a letter are significantly different. 


Figure 11.3 Tukey’s method using Minitab | 


The Interpretation of « in Tukey’s Procedure 

We stated previously that the simultaneous confidence level is controlled by Tukey’s method. So 
what does “simultaneous” mean here? Consider calculating a 95% CI for a population mean jp based 
on a sample from that population and then a 95% CI for a population proportion p based on another 
sample selected independently of the first one. Prior to obtaining data, the probability that the first 
interval will include p is .95, and this is also the probability that the second interval will include p. 
Because the two samples are selected independently of each other, the probability that both intervals 
will include the values of the respective parameters is (.95)(.95) = (.95)" = .9025. Thus the simul- 
taneous or joint confidence level for the two intervals is roughly 90%—if pairs of intervals are 
calculated over and over again from independent samples, in the long run 90.25% of the time the first 
interval will capture 4 and the second will include p. Similarly, if three CIs are calculated based on 
independent samples, the simultaneous confidence level will be 100(.95)°% ~ 86%. Clearly, as the 
number of intervals increases, the simultaneous confidence level that all intervals capture their 
respective parameters will decrease. 

Now suppose that we want to maintain the simultaneous confidence level at 95%. Then for two 
independent samples, the individual confidence level for each would have to be 100\V/.95% = 97.5%. 
The larger the number of intervals, the higher the individual confidence level would have to be to 
maintain the 95% simultaneous level. 

The tricky thing about the Tukey intervals is that they are not based on independent samples— 
MSE appears in every one, and various intervals share the same X;.’s (e.g., in the case J = 4, three 
different intervals all use x,.). This implies that there is no straightforward probability argument for 
ascertaining the simultaneous confidence level from the individual confidence levels. Nevertheless, if 
Qos is used, the simultaneous confidence level is controlled at 95%, whereas using Qo; gives a 
simultaneous 99% level. To obtain a 95% simultaneous level, the individual level for each interval 
must be considerably larger than 95%. Said in a slightly different way, to obtain a 5% experimentwise 
or family error rate, the individual or per-comparison error rate for each interval must be considerably 
smaller than .05. 


Confidence Intervals for Other Parametric Functions 

In some situations, a CI is desired for a function of the ju;’s more complicated than a difference 
Hj — Mj; Let 0= Soci; Where the c,’s are constants. One such function is 5 (Mi + fy) — 
+ (Ms + [4 ++ Us), which in the context of Example 11.5 measures the difference between the group 


consisting of the first two brands and that of the last three brands. Because the X;;’s are normally 
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distributed with E(X;) = pu; and V(X;;) = a, the natural estimator 0 = oa c:X;. is normally distributed, 
unbiased for 0, and 


i i 


Estimating o? by MSE and forming Ga results in a ¢ variable (0 — 0)/6), which can be manipulated to 
obtain the following 100(1 — «)% confidence interval for )> c;{4;: 


Sci. © tayaaa—1y\/ MSE - 5 c?/J (11.3) 


Example 11.8 (Example 11.5 continued) The parametric function for comparing the first two (store) 
brands of oil filter with the last three (national) brands is 0 = $(f; + My) — 4 (M3 +4 + Ms), from 


=4=G) +) +3) +3) +0) -8 


With 6 = (x. +X.) —4(%3. +. X4 +35.) = 583 and MSE = .088, a 95% CI for 0 is 


583 + 2.021 /(.088) - (5/6)/9 = .583 + .182 = (.401, .765) a 


Notice that in the foregoing example the coefficients c,, ..., cs satisfy }> cj = 5 +3-3-3-FZ= 
When the coefficients sum to 0, the linear combination 0 = )> c;1; is called a contrast among the 
means, and the analysis is available in a number of statistical software programs. 

Sometimes an experiment is carried out to compare each of several “new” treatments to a control 
treatment. In such situations, a multiple comparisons technique called Dunnett’s method is appro- 


priate. 


Exercises: Section 11.2 (15-26) 


15. An experiment to compare the spreading other? Be sure to use the method of under- 
rates of five different brands of yellow scoring to illustrate your conclusions, and 
interior latex paint available in a particular write a paragraph summarizing your results. 
area used 4 gallons (J = 4) of each paint. 17. Repeat the previous exercise supposing that 


The sample average spreading rates (ft?/gal) %. = 502.8 in addition to 3, = 427.5. 
for the five brands were x,. = 462.0, 
X>. = 512.8, ¥3. = 437.5, X4. = 469.3, and ‘18. Consider the data on maximum load for 


Xs. = 532.1. The computed value of F was cracked wood beams presented in Exercise 4. 
found to be significant at level « = .05. With Would it make sense to apply Tukey’s 
MSE = 272.8, use Tukey’s procedure to method to this data? Why or why not? 
investigate significant differences in the true [Hint: The P-value from the analysis of 
average spreading rates between brands. variance is .169.] 


19. Use Tukey’s procedure on the data of 
Example 11.1 to identify differences in true 
average compression strength among the 


16. In the previous’ exercise, suppose 
X3. = 427.5. Now which true average 
spreading rates differ significantly from each 


11.2 


20. 


21. 


22. 


Multiple Comparisons in ANOVA 


four box types. Is your answer consistent 
with the boxplot in Figure 11.la? 


Use Tukey’s procedure on the data of 
Example 11.3 to identify differences in true 
mean maximum power per unit area among 
the five modules. 

Of the five modules in Example 11.3, the 
first two are existing commercial devices 
and the last three are prototypes constructed 
by the study’s authors. Compute a 95% t CI 
for the 0 =5(m, + 4)- 


3 (H3 + My + Hs). 


contrast 


The article “Iron and Manganese Present in 
Underground Water Promote Biochemical, 
Genotoxic, and Behavioral Alterations in 
Zebrafish” (Environ. Sci. Pollut. Res. 2019: 
23555-23570) reports the following data 
on micronucleus frequency in the muscle 
tissues for zebrafish exposed to varying 
concentrations of iron and manganese. 
Micronuclei are a warning sign of possible 
DNA damage. There were J = 10 zebrafish 
in each group. 


Treatment Xj. Sj 

1. Control 53.72 4.30 
2. Fe 0.8 139.11 6.41 
3. Fe 1.3 93.62 4.49 
4. Mn 0.2 134.66 13.20 
5. Mn 0.4 141.12 8.32 
6. Fe 0.8/Mn 0.2 101.25 15.41 
7. Fe 1.3/Mn 0.4 124.50 9.60 


a. Perform an analysis of variance at the 
.O1 significance level. [Note: Though 
unequal variances are a concern here, 
the balanced study design should at least 
partially mitigate that issue.] 

b. Apply Tukey’s method to determine which 
treatments result in significantly different 
mean micronucleus frequencies. 

c. Suppose that 100(1 — «)% CIs for 
k different parametric functions are 
computed from the same ANOVA data 
set. Then it is easily verified that the 
simultaneous confidence level is at least 
10001 — ka)%. Compute CIs with 
simultaneous confidence level at 


23. 


24. 


25. 


26. 
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98% the 
My = $ (My + og + Hy + Ms + og + Hy) 
and 4 (ty + M3 +s + M7) — 5 (Ha + Ms). 


least for contrasts 


Refer back to the bike helmet data of 
Exercise 5. 


a. Apply Tukey’s procedure at the .05 level 
to determine which bike helmets have 
honestly significantly different mean peak 
linear acceleration under the specified 
experimental conditions. Use SSE = 3900. 

b. Seven of the 10 brands are considered 
road helmets (elongated shape with 
aerodynamic venting), while three 
brands—BMIPS (1), GMIPS (5), and 
NW (7)—are nonroad helmets. Com- 
pute a 95% CI for the contrast 0 = 
9 (Ha + ++ + Ho) — 3 (Hi + Hs + My), 
where the first sum spans across the 
seven road helmet brands. 


Consider the accompanying data on plant 
growth after the application of different 
types of growth hormone. 


Hormone Growth 
1 13 17 7 14 
2 21 13 20 17 
3 18 15 20 17 
4 7 11 18 10 
5 6 11 15 8 


a. Perform an F test at level « = .05. 
b. What happens when Tukey’s procedure 
is applied? 


Consider a single-factor ANOVA experi- 
ment in which J=3, J =5, X,. = 10, 
X. = 12, and x3. = 20. Determine a value 
of SSE for which f> F512, so that 
Ho: Ly = Uo = My is rejected, yet when 
Tukey’s procedure is applied none of the 
Ls differ significantly from each other. 


Refer to the previous exercise and suppose 
X1. = 10, x2. = 15, and x3. = 20. Can you 
now find a value of SSE that produces such 
a contradiction between the F test and 
Tukey’s procedure? 
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11.3. More on Single-Factor ANOVA 


In this section, we consider some additional issues relating to single-factor ANOVA. These include an 
alternative description of the model parameters, power and f for the F test, the relationship of the test 
to procedures previously considered, data transformation, a random effects model, and formulas for 
the case of unequal sample sizes. 


An Alternative Description of the ANOVA Model 
The assumptions of single-factor ANOVA can be described succinctly through the model equation 


Xij = Hj + &j 


where éj represents a random deviation from the population or true treatment mean p;. The ¢,’s are 
assumed to be independent, normally distributed rvs Gmplying that the X;;’s are also) with E(¢,j) = 0 
[so that E(X;) = 4] and V(é;) = e [from which V(X;) = a for every i and j]. An alternative 
description of single-factor ANOVA will give added insight and suggest appropriate generalizations 
to models involving more than one factor. Define a parameter py by 


it 
bas Mi 
i=l 


and parameters 0, ..., % by 


u=p—pw (=1,...,f) 


Then the treatment mean p; can be written as 4 + «;, where pz represents the true average overall 
response across all populations/treatments, and «; is the effect, measured as a departure from ju, due to 
the ith treatment. Whereas we initially had J parameters, we now have J + 1: LU, 0, ..., %. However, 
because )~ «; = 0 (the average departure from the overall mean response is zero), only J of these new 
parameters are independently determined, so there are as many independent parameters as there were 
before. In terms of and the «;’s, the model equation becomes 


Xi = My + Ui + ej G=]lecglp=H lass) 


In the next two sections, we will develop analogous models for two-factor ANOVA. The claim that 
the y;’s are identical is equivalent to the equality of the «;’s, and because )> a; = 0, the null 
hypothesis becomes 


Ho : 04 09) aotee ay 0 


In Section 11.1, it was stated that MSTr is an unbiased estimator of oa” when Ho 1s true but otherwise 
tends to overestimate o*. More precisely, Exercise 14 established that 


J J 
2 2_ 2 2 
E(MSTr) = 0° + Tol s (uw, - pb) =o t+ f24 S a; 


When Hp is true, >> a? = 0 so E(MSTr) = a (MSE is unbiased whether or not Ho is true). If > a? is 
used as a measure of the extent to which Hp is false, then a larger value of 5> on? will result in a greater 
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tendency for MSTr to overestimate o°. More generally, formulas for expected mean squares for 
multifactor models are used to suggest how to form F ratios to test various hypotheses. 


Power and # for the F Test 

Consider a set of parameter values 01, %2, ..., %; for which Hp is not true. The power of the ANOVA 
F test is the probability that Hp is rejected when that set is the set of true values, and the probability of 
a type II error is 6 = 1 — power. One might think that power and f would have to be determined 
separately for each different configuration of «;’s. Fortunately, power for the F' test depends on the «;’s 
only through >> a, and so it can be simultaneously evaluated for many different alternatives. For 
example, }* a? =4 for each of the following sets of «;’s, so power is identical for all three 
alternatives: 

lg =—-l_w=-1l,we=1l,u=1 

2. «4 = —V2, wm = V2, 03 = 0, «4 = 0 


3. 1 = —V3, o = 4/1/3, a3 = 4/1/3, 14 = 1/3 


When Ho is false, the test statistic MSTr/MSE has a noncentral F distribution, a three-parameter 
family. For one-way ANOVA, the first two parameters are still v; = numerator df = 7 —- 1 and v2 = 
denominator df = IJ — I, while the noncentrality parameter / is given by 


J 
A= SG 


Power is an increasing function of the noncentrality parameter / (and f is a decreasing function of 2). 
Thus, for fixed values of 0 and J, the null hypothesis is more likely to be rejected for alternatives far 
from Hp (large )~ «?) than for alternatives close to Ho. For a fixed value of )> «?, power increases and 
fh decreases as the sample size J on each treatment increases, whereas power decreases and 
increases as the error variance o” increases (since greater underlying variability makes it more difficult 
to detect any given departure from Hp). 

Because hand computation of power, f, and sample size for the F test are quite difficult (as in the 
case of f tests), software is typically required. For one-way ANOVA, and with the noncentral 
F parameters specified as above, 


B = P(Apis not rejected when Hp is false) 
= P(F <Fy7-1,y—-1 when F ~ noncentral F) 


= noncentral cdf evalulated at Fy 71 2-7 


and power = | — f. Many statistical packages (including SAS, JMP, and R) have a function that 
calculates the cumulative area under a noncentral F curve (required inputs are the critical value F,,, the 
numerator df, the denominator df, and /), and this area is f. 


Example 11.9 The effects of four different heat treatments on yield point (tons/in”) of steel ingots 
are to be investigated. A total of J = 8 ingots will be cast using each treatment. Suppose the true 
standard deviation of yield point for any of the four treatments is o = 1. How likely is it that Hp will 
not be rejected at level .05 if three of the treatments have the same expected yield point and the other 
treatment has an expected yield point that is 1 ton/in? greater than the common value of the other 
three (i.e., the fourth yield is on average 1 standard deviation above those for the first three 
treatments)? 
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Suppose that py = fo = fl and fly = fy + 1,80 w= (D0 H;)/4 = my + g. Then a = fy —H=—5 


=a|C-a) +a) +) +) = 

ye 4 4 4 4 7 

The degrees of freedom are vy; = J— 1 = 3 and v2 = JJ —I = 28, and the F critical value for the .05 test 
is F'o53,28 = 2.947. The probability of a type II error here is Bf % .54, obtained through the R 
command pf (2.947,df£1=3,d£2=28,ncp=6), and power = .46. This power is rather low, so 
we might decide to increase the value of J. How many ingots of each type would be required to yield 


fb = .05 (about 95% power) for the alternative under consideration? By trying different values of J, it 
can be verified that J = 24 will meet the requirement, but any smaller J will not. i 


In lieu of directly accessing the noncentral F cdf, some software packages will calculate power 
when the user specifies all the necessary information. For example, R has a function that allows 
specification of all J of the means, along with any three among J, 0”, a, and power. The function 
calculates whichever quantity is unspecified. For example, we might wish to calculate the power 
of the test with «= .05, g=1, 7=4, J=2, uw, = 100, w= 101, wz, = 102, and py = 106. 
The R function calculates power = .904 (and so fh = .096). 

Minitab v.19 does something rather different. The user is asked to specify the maximum difference 
between p;’s rather than the individual means. For example, in the previous scenario the maximum 
difference is 106 — 100 = 6. However, power depends not only on this maximum difference but on the 
values of all the ju;’s. In this situation Minitab calculates the smallest possible value of power subject 
to 4, = 100 and x4 = 106, which occurs when the two other p;’s are both halfway between 100 and 
106. This power is .86, so we can say that the power is at least .86 and f is at most .14 when the two 
most extreme ju;’s are separated by 6. The software will also determine the necessary common sample 
size if maximum difference and minimum power are specified. 


Relationship of the F Test to the t Test 
When the number of populations is just J = 2, the ANOVA F test is testing Hp: 4; = flo versus 
Hi: 1, # bo. In this case, a two-tailed, two-sample f test could also be used. In Section 10.2, we 
mentioned the pooled t test, which requires equal variances, as an alternative to the two-sample 
t procedure. With a little algebra, it can be shown that the single-factor ANOVA F test and the two- 
tailed pooled ¢ test are equivalent: for any given data set, the P-values for the two tests will be 
identical, so the same conclusion will be reached by either test. (The test statistic values are related by 
f=t) 

The two-sample ¢ test is more flexible than the F test when J = 2 for two reasons. First, it is not 
based on the assumption that o; = 02; second, it can be used to test H,: 4, > [2 (an upper-tailed f¢ test) 
or H,: Ly < Uo (a lower-tailed test) as well as Hy: 1) 4 bb. 


Single-Factor ANOVA When Sample Sizes Are Unequal 

When the sample sizes from each population or treatment are not equal (i.e., an unbalanced study 
design), let J,, Jz, ..., Jy denote the J sample sizes and let n = )°,J; denote the total number of 
observations. The accompanying box gives ANOVA formulas and the test procedure. 


11.3. More on Single-Factor ANOVA 665 


Grand mean: X.. = 1S yy Xy= 14 JiX;. (a weighted average of the sample means) 
Fundamental ANOVA Identity: SST = SSTr + SSE 
where the three sums of squares and associated dfs are 


Todi I 
sstr= 5°50 (X.-X.)?=S04(%.-X.)° df =1-1 
i=l j=l i=1 
Id; I I 
SSE= S_) > (X;-%)’ = (i - 1)? df=" (,-l)=n-1 
i=l j=l i=1 i=] 
Idi > 
SST = 5° (Xj —X.)° = SSTr + SSE df =n—1 


Test statistic value: 


MSTr SSTr SSE 
= VISE where MSTr = 7] a, 


f 


Rejection region: f > Fy 7—1.n—1 
P-value: area under the F7_;,—; curve to the right of f 


Verification of the fundamental ANOVA identity proceeds as in the case of equal sample sizes. 
However, it is somewhat trickier here to show that MSTr/MSE has the F distribution under Ho. 
Validity of the test procedure requires assuming, as before, that the population distributions are all 
normal with the same variance. The methods described at the end of Section 11.1 for assessing these 
with the residuals ej; = xj; — x;. can also be applied here. 


Example 11.10 The article “On the Development of a New Approach for the Determination of 
Yield Strength in Mg-Based Alloys” (Light Metal Age, Oct. 1998: 51-53) presented the following 
data on elastic modulus (GPa) obtained by a new ultrasonic method for specimens of an alloy 
produced using three different casting processes. 


Process Observations Jj Xi: Si 

Permanent molding 45.5 45.3 45.4 44.4 44.6 43.9 44.6 44.0 8 44.71 .624 

Die casting 44.2 43.9 44.7 44.2 44.0 43.8 44.6 43.1 8 44.06 SOL 

Plaster molding 46.0 45.9 44.8 46.2 45.1 45.5 6 45.58 549 
22 


Let 11), [2, and pz denote the true average elastic moduli for the three different processes under the 
given circumstances. The relevant hypotheses are Hp: dy = M2 = Us versus H,: at least two of the 1;’s 
are different. The test statistic is, of course, F = MSTr/MSE, based on J — 1 = 2 numerator df and 
n—I = 22-3 = 19 denominator df. Relevant quantities include 
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i 
= 55 | 
SSTr = 8(44.71 — 44.71)? + 8(44.06 — 44.71)? + 6(45.58 — 44.71)” = 7.93 
SSE = (8 — 1)(.624)” + (8 — 1)(.501)? + (6 — 1)(.549)* = 6.00 


x, 


8(357.7) + 8(352.5) + 6(273.5)] = 44.71 


The remaining computations are displayed in the accompanying ANOVA table. Since f = 12.56 > 
F 001,2,19 = 10.16, the P-value is smaller than .001. Thus the null hypothesis should be rejected at any 
reasonable significance level; there is compelling evidence for concluding that true average elastic 
modulus somehow depends on which casting process is used. 


Source of variation df Sum of squares Mean square Ff 
Treatments 2 7.93 3.965 12.56 
Error 19 6.00 3158 

Total 21 13.93 


Multiple Comparisons When Sample Sizes Are Unequal 

There is more controversy among statisticians regarding which multiple comparisons procedure to 
use when sample sizes are unequal than there is in the case of equal sample sizes. The procedure that 
we present here is called the Tukey—Kramer procedure for use when the J sample sizes J), Jo, ..., J; 
are reasonably close to each other (“mild imbalance”’). It modifies Tukey’s method [Equation (11.2)] 
by using averages of pairs of 1/J;’s in place of I/J. Let 


MSE /1 1 
dij = Oxtn—1 : (+ + x) 
i ae ae 


Then the probability is approximately 1 — « that 


(Xi. — Xj.) — dig Sms — wy S (Ke — Xj) $y 
for every iandj @=1,...,andj=1,..., D with i Fj. 

The simultaneous confidence level 100(1 — «)% is now only approximate. The underscoring 
method can still be used, but now the honestly significant difference dj; used to decide whether x;. and 
x;. can be connected by a line segment will depend on J; and J;. 


Example 11.11 (Example 11.10 continued) The sample sizes for the elastic modulus data were 
J, = 8, Jo = 8, Jz = 6, and J=3, n — J= 19, MSE = .316. A simultaneous confidence level of 
approximately 95% requires Q 953.19 = 3.59, from which 


316 


1 1 
diz = 3.59, | —— (; 


+ :) =.713 d\3 = .771 dhy3 = .771 


2 \8 8 


Since xX). — x9. = 44.71 — 44.06 = .65<dj2, wl, and p> are judged not significantly different. The 
accompanying underscoring scheme shows that 4, and ps3 differ significantly, as do fs and ps3. 
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2. Die 1. Permanent 3. Plaster 
44.06 44.71 45.58 


Data Transformation 

The use of ANOVA methods can be invalidated by substantial differences in the variances a. cay a, 
which until now have been assumed equal with common value o7. It sometimes happens that 
V(X) = a? = g(u;), a known function of s1; (so that when Hp is false, the variances are not equal). 
For example, if X;; has a Poisson distribution with parameter py; (approximately normal if “4; > 10), 
then oe = H;, SO g(u;) = py; is the known function. In such cases, one can often transform the X;;’s to 
h(X;;)’s so that they will have approximately equal variances (while hopefully leaving the transformed 
variables approximately normal), and then the F test can be used on the transformed observations. 
The basic idea is that, if h(-) is a smooth function, then we can express it approximately using the first 
terms of a Taylor series: h(Xj) ~ h(i) + h'(ui)(Xy — pi). Then V[h(Xj)] & ViXy) - [h'(u I" 7 
g(Ui) - [h'(u 1°. We now wish to find the function h(-) for which g(;) - [h(a 1° = c (a constant) for 
every i. Solving this for h’(u;) and integrating give the following result. 


PROPOSITION If V(X) = g(ui), a known function of y;, then a transformation h(X,;) that 
“stabilizes the variance” so that V[h(X;;)] is approximately the same for 


each i is given by h(x) « f [e(x)) V7 de. 


In the Poisson case, g(x) = x, so h(x) should be proportional to i. x7!/2dx = 2x!/2, Thus Poisson data 
should be transformed to h(x) = \/xi before the analysis. 


A Random Effects Model 

The single-factor problems considered so far have all been assumed to be examples of a fixed effects 
ANOVA model. By this we mean that the chosen levels of the factor under study are the only ones 
considered relevant by the experimenter. The single-factor fixed effects model is 


Xj=e+aj+e; with Soo, =0 (11.4) 


where the ¢,’s are random and both y and the «,’s are fixed parameters whose values are unknown. 

In some single-factor problems, the particular levels studied by the experimenter are chosen, either 
by design or through sampling, from a large population of levels. For example, to study the effect of 
using different operators on task performance time for a particular machine, a sample of five operators 
might be chosen from a large pool of operators. Similarly, the effect of soil pH on the yield of soybean 
plants might be studied by using soils with four specific pH values chosen from among the many 
possible pH levels. When the levels used are selected at random from a larger population of possible 
levels, the factor is said to be random rather than fixed, and the fixed effects model (11.4) is no longer 
appropriate. An analogous random effects model is obtained by replacing the fixed «,’s in (11.4) by 
random variables. The resulting model description is 
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V(ei) =o" V(Ai) = a4 | 


with all A;’s and ¢;’s normally distributed and independent of each other. 


The condition E(A;) = 0 in (11.5) is similar to the condition }* «; = 0 in (11.4); it states that the 
expected or average effect of the ith level measured as a departure from p is zero. 

For the random effects model (11.5), the hypothesis of no effects due to different levels is 
Ho: o% =0, which says that different levels of the factor contribute nothing to variability of the 
response. Critically, although the hypotheses in the single-factor fixed and random effects models are 
different, they are tested in exactly the same way: by forming F = MSTr/MSE and rejecting Ho if 
f => Fuxrin+ This can be justified intuitively by noting in the random effects model that 
E(MSE) = a (as for fixed effects), whereas 


1 2 
E(MSTr) = o? + 7 (x _ me ej (11.6) 


where again Jj, Jo, ..., J; are the sample sizes and n = )~ J;. The factor in parentheses on the right 
side of (11.6) is nonnegative, so once again E(MSTr) = o° if Ho is true (ie., if a = 0) and 
E(MSTr) > o° if Ho is false. 


Example 11.12 When items are machined out of metal (or plastic or wood) sheets by drills, 
undesirable burrs form along the edge. The article “Observation of Drilling Burr and Finding out the 
Condition for Minimum Burr Formation” (nt. J. Manuf. Engr. 2014) reports on a study of the effect 
that cutting speed has on burr size. Eighteen measurements were made at each of three speeds (20, 25, 
and 31 m/min) that were randomly selected from the range of possible speeds for the particular 
equipment used in the experiment. Each measurement is the burr height (mm) from drilling into a 
low-alloy steel specimen. The data is summarized in the accompanying table along with the derived 
ANOVA table. The very small f statistic and correspondingly large P-value indicates that Ho: a; =0 
should not be rejected. The data does not indicate cutting speed impacts burr size. 


Speed (m/min) Xj. S; 

20 1.558 2.018 

25 1.998 2.415 

31 1.867 2.148 

Source of variation df SS MS f P-value 
Cutting speed 2 1.837 0.9186 0.19 828 
Error 51 246.795 4.8931 

Total 53 248.632 
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Exercises: Section 11.3 (27-44) 


27. 


28. 


29. 


The following data refers to yield of 
tomatoes (kg/plot) for four different levels 
of salinity; salinity level here refers to 
electrical conductivity (EC), where the 
chosen levels were EC = 1.6, 3.8, 6.0, and 
10.2 nmhos/cm: 


EC Yield 
1.6 59.5 53:3 56.8 63.1 58.7 
3.8 55.2 59.1 52.8 54.5 
6.0 51.7 48.8 53.9 49.0 

10.2 44.6 48.5 41.0 47.3 46.1 


Use the F test at level « = .05 to test for any 
differences in true average yield due to the 
different salinity levels. 

Apply the modified Tukey’s method to the 
data in the previous exercise to identify 
significant differences among the 1;’s. 


A study at Bentley College, a large business 
school in the eastern United States, exam- 
ined students’ anxiety levels toward the 
subject of accounting (“Determinants of 
Accounting Anxiety in Business Students,” 
J. College Teach. Learn. 2004). A repre- 
sentative sample of 1020 students com- 
pleted the Accounting Anxiety Rating Scale 
(AARS) questionnaire; higher scores (out 
of 100) indicate greater anxiety. Summary 
data broken down by grade level appears in 
the accompanying table. 


Class level n x Ss 

Freshman 86 48.95 9.13 
Sophomore 224 51.45 11.29 
Junior 225 52.89 11.32 
Senior 198 52.92 11.32 
Graduate 287 45.55 10.10 


a. Comment on the plausibility (or neces- 
sity) of the normality and equal variance 
assumptions for this example. 

b. Test at « = .05 the hypothesis that the 
mean accounting anxiety level for 
business students varies by class level. 


c. Apply the Tukey—Kramer method to 
identify significant differences among 
the 1;’s. 


30. The article “From Dark to Light: Skin 


31. 


Color and Wages among African Ameri- 
cans” (J. Human Res. 2007: 701-738) 
includes the following information on 
hourly wages ($) for representative samples 
of the indicated populations. 


Skin color n x Ss 

White 513 15.94 7.73 
Light Black et 14.42 6.05 
Medium Black 177 13.23 6.64 
Dark Black 207 11.72 5.60 


a. Does population mean hourly wage 
appear to depend on skin color? Carry 
out an appropriate test of hypotheses. 

b. Identify significant differences among 
the pu;’s at the .05 significance level. 


The authors of the article “Exploring the 
Impact of Social Media Practices on Wine 
Sales in US Wineries” (J. Direct Data, 
Digital Market. Pract. 2016: 272-283) 
interviewed 361 winery managers. Each 
manager was asked to report, as a percent- 
age of sales, the impact of social media use 
on their wine sales. Each winery’s social 
media presence was then categorized by the 
number of social media platforms it used: 
0-2, 3-5, or 6 or more. Summary infor- 
mation appears in the accompanying table. 
Test to see whether an association exists 
between social media presence and sales at 
the .01 significance level. 


No. of platforms n Mean SD 

Two or fewer 107 12.76 13.00 
Three to five 164 17.23 14.07 
Six or more 90 21.56 17:35 


670 


32. 


33. 


The article “Can You Test Me Now? 
Equivalence of GMA Tests on Mobile and 
Non-Mobile Devices” (Unt. J. Select Assess. 
2017: 61-71) describes a study in which 
1769 people were recruited through Ama- 
zon Mechanical Turk to take a general 
mental ability test. (Researchers used the 
WPT-R test, a variation on the Wonderlic 
Personnel Test used by the NFL to evaluate 
players.) Participants were randomly 
assigned to take the test on one of three 
electronic devices; the data is summarized 
below. 


Device n Mean score SD 
Computer 724 33.98 7.65 
Tablet 476 34.40 TAS 
Smartphone 569 34.26 7.59 


The goal of the study was to determine 
whether the three devices can be considered 
“equivalent” for the purpose of adminis- 
tering mental ability tests. Perform an 
ANOVA at the .05 significance level, and 
explain what you discover. 


Lipids provide much of the dietary energy 
in the bodies of infants and young children. 
There is a growing interest in the quality of 
the dietary lipid supply during infancy as a 
major determinant of growth, visual and 
neural development, and long-term health. 
The article “Essential Fat Requirements of 
Preterm Infants” (Amer. J. Clin. Nutrit. 
2000: 245S—250S) reported the following 
data on total polyunsaturated fats (%) for 
infants who were randomized to four dif- 
ferent feeding regimens: breast milk, corn- 
oil-based formula, soy-oil-based formula, 
or soy-and-marine-oil-based formula: 


Regimen Sample Sample Sample 
size mean SD 

Breast 8 43.0 LS 

milk 

co 13 42.4 1.3 

NY@) 17 43.1 1.2 

SMO 14 43.5 1.2 


34. 


35. 


11. The Analysis of Variance 


a. What assumptions must be made about 
the four total polyunsaturated fat distri- 
butions before carrying out a single- 
factor ANOVA to decide whether there 
are any differences in true average fat 
content? 

b. Carry out the test suggested in part (a). 
What can be said about the P-value? 


Samples of six different brands of diet/ 
imitation margarine were analyzed to deter- 
mine the level of physiologically active 
polyunsaturated fatty acids (PAPFUA, in 
percentages). The data below is consistent 
with a study carried out by Consumer Reports: 


Brand PAPFUA 

Imperial 1441 136 144 143 

Parkay 12.8 125 134 13.0 12.3 
Blue Bonnet 13.5 134 141 143 
Chiffon 13.2 12.7) 126 13.9 
Mazola 16.8 17.2 164 17.3 18.0 
Fleischmann’s 18.1 17.2) 18.7 184 


a. Use ANOVA to test for differences 
among the true average PAPFUA per- 
centages for the different brands. 

b. Compute Cls for all (44; — y4;)’s. 

c. Mazola and Fleischmann’s are corn- 
based, whereas the others are soybean- 
based. Compute a CI for 


My t My Fiat M4 Ms + Ue 
4 2 


[Hint: Modify the expression for V(@) that 
led to (11.3) in the previous section.] 


Subacromial impingement syndrome 
(SIS) refers to shoulder pain resulting from 
a particular impingement of the rotator cuff 
tendon. The article “Evaluation of the 
Effectiveness of Three Physiotheraputic 
Treatments for SIS” (Physiotherapy 2016: 
57-63) reports a study in which 99 SIS 
sufferers were randomly assigned to receive 
one of three treatments across 20 sessions. 
The Constant-Murley score (CMS), a 
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36. 


37. 


38. 


standard measure of shoulder functionality 
and pain, was administered to each subject 
before the experiment and again one month 
post-treatment. The accompanying table 
summarizes the change in CMS (higher 
numbers are better) for the subjects. 


Treatment n Mean SD 
Ultrasound 32 it | 10.3 
Phonophoresis 33 6.4 9.3 
Iontophoresis 34 6.5 18.1 


Test to see whether the true mean increase 
in CMS differs across the three different 
treatments, at the « = .05 significance level. 


In single-factor ANOVA with sample sizes 
J; @=1, ..., D, show that SSTr= 
1 4i(X, —¥.)°= OFX, — nk, 
n= > rl J, is 

When sample sizes are equal (J; = J), the 
parameters 01, 0%, ..., « of the alternative 
parameterization to the model equation are 
restricted by 5> a; = 0. For unequal sample 


sizes, the most natural restriction is 
>o Jia; = 0. Use this to show that 


where 


1 
a) ) 
E(MSTr) = 0? + ~— Sie 


What is E(MSTr) when Hp is true? [This 
expectation is correct if }>Jja;=0 is 
replaced by the restriction }> «; = 0, or any 
other single linear restriction on the «;’s 
used to reduce the model to J independent 
parameters, but >> Jia; = 0 simplifies the 
algebra and yields natural estimates for the 
model parameters. ] 

Reconsider Example 11.9 involving an 
investigation of the effects of different heat 
treatments on the yield point of steel ingots. 


a. If J = 8 and o = 1, what is f for a level 
.O5 F test when fy = Lb, U3 = fy — 1, 
and M4 = pl, + 1? 


39. 


40. 


41. 
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b. For the alternative y;’s of part (a), what 

value of J is necessary to obtain f = .05? 

c. If there are J=5 heat treatments, 

J = 10, and o = 1, what is f for the level 

.05 F test when four of the y;’s are 

equal and the fifth differs by 1 from the 
other four? 


For unequal sample sizes, the noncentrality 
parameter for F test power calculations is 
A= dia? /o*. Referring to Exercise 27, 
what is the power of the test when Mo = Ls, 
Ly = fb — 9, and fy = fo + 0? 


The following data on number of cycles to 
failure (x 10°) appears in the article “An 
Experimental Study of Influence of Lubri- 
cation Methods on Efficiency and Contact 
Fatigue Life of Spur Gears” (J. Tribol. 2018). 


Lubrication Cycles to failure (x 10°) 
condition 

D1 18.79 10.44 14.62 12.53 8.35 14.62 
Jil. 16.70 18.79 12.53 26.10 10.44 18.27 
J2 10.44 14.62 16.70 29.23 22.97 18.50 
J4 6.26 6.26 6.26 6.26 4.70 5.22 


a. Calculate the standard deviation of each 
sample. Why should we be reluctant to 
proceed with an analysis of variance? 

b. Take the logarithm of the observations. 
Do these transformed values adhere better 
to the conditions for ANOVA? Explain. 

c. Perform one-way ANOVA on these 
transformed values. 


Many studies have been conducted to 
measure mercury (Hg) levels in fish, but 
little information exists on Hg concentra- 
tions in marine mammals. The article 
“Factors Influencing Exposure of North 
American River Otters...to Mercury Rela- 
tive to a Large-Scale Reservoir in Northern 
British Columbia, Canada” (Ecotoxicology 
2019: 343-353) describes a study in which 
river otters living in five Canadian 
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reservoirs were tested for Hg concentration 42. Simplify E(MSTr) for the random effects 


(mg/kg). The accompanying summary model when J; = Jp = + = J; = J. 
spennenon Wee Cacia st A eeap mn 43. Suppose that X;; is a binomial variable with 
the article. 


parameters n and p; (so it is approximately 
normal when np; > 10 and nq; > 10). 


Hg In(Concentration) 


concentration Then since yw; =npi, V (Xj) =¢ = 
Reservoir n Mean SD Mean SD npi(1 — pi) = u(1 = i; /n). How should 
PW 15 41 3.0 1.30 Al the X;;’s be transformed so as to stabilize 
aes . Fe ies ee fF the variance? [Hint: g(u;) = (1 — u;/n).] 
as re es ue 44. In an experiment to compare the quality of 


four different brands of magnetic tape 
(A-D), five 5000-foot reels of each brand 
were selected and the number of cosmetic 
flaws on each reel was determined. 


Hg concentration distributions at all five 
reservoirs were heavily right-skewed. 


a. Why should we hesitate to perform one- 


way ANOVA on the Hg concentration Brand No. of flaws 
data? A 10 5 12 14 8 
b. Consider the transformation y = B 14 12 17 9 8 
: : Cc 13 18 10 15 18 
In(concentration). Estimated summary D o ie ie a ‘a 


information appears above, and taking 
the logarithm greatly reduces skewness. 
Apply the ANOVA F test to the trans- 
formed data, and report your findings at 
the .05 significance level. 

c. Apply Tukey’s method to the log- 
transformed data, if appropriate. 


It is believed that the number of flaws has 
approximately a Poisson distribution for 
each brand. Analyze the data at level .01 to 
see whether the expected number of flaws 
per reel is the same for each brand. 


11.4 Two-Factor ANOVA without Replication 


In many situations there are two factors of simultaneous interest. For example, a baker might 
experiment with J = 3 temperatures (325, 350, 375 °F) and J = 2 baking times (45 min, 60 min) to 
optimize a new cake recipe. Or, an industrial engineer might wish to study the surface roughness 
resulting from a certain machining process; she might carry out an experiment at various combina- 
tions of cutting speed and feed rate. 

Call the two factors of interest A and B. When factor A has / levels and factor B has J levels, there 
are IJ different combinations (pairs) of levels of the two factors, each called a treatment. If the data 
includes multiple observations for each treatment, the study is said to include replication. With K; 
= the number of observations on the treatment (factor A, factor B) = (i, 7), we focus in this section on 
the case K;; = | (.e., no replication), so that the data consists of LJ observations. We will first discuss 
the fixed effects model, in which the only levels of interest for the two factors are those actually 
represented in the study. The case in which at least one factor is random is considered briefly at the 
end of the section. 
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Example 11.13 Is it really as easy to remove marks on fabrics from erasable pens as the word 
“erasable” might imply? Consider the following data from an experiment to compare three different 
brands of pens and four different wash treatments with respect to their ability to remove marks on a 
particular type of fabric (based on “An Assessment of the Effects of Treatment, Time, and Heat on the 
Removal of Erasable Pen Marks from Cotton and Cotton/Polyester Blend Fabrics,” J. Test. Eval. 
1991: 394-397). The response variable is a quantitative indicator of overall specimen color change; 
the lower this value, the more marks were removed. 


Washing treatment 


1 2 3 4 Average 
1 97 48 48 46 598 
Brand of pen 2 77 14 22 25 345 
3 .67 39 OF 19 A455 
Average .803 337 423 .300 


Is there any difference in the true average amount of color change due either to the different brands 
of pen or to the different washing treatments? a 


As in single-factor ANOVA, double subscripts are used to identify random variables and observed 
values. Let 


X;; = the random variable denoting the measurement when (factor A, factor B) = (i, j) 
Xj = the observed value of Xj 


The x;’s are usually presented in a two-way table in which the ith row contains the observed values 
when factor A is held at level i and the jth column contains the observed values when factor B is held 
at level j. In the erasable-pen experiment of Example 11.13, the number of levels of factor A (pen 
brand) is J = 3, the number of levels of factor B (washing treatment) is J = 4; x13 = .48, x22 = .14, 
and so on. 

Whereas in single-factor ANOVA we were interested only in row means and the grand mean, here 
we are interested also in column means. Let 


X= the average of data obtained _ Sy 
"“~ when factorA is held at leveli J = y 
Pow the average of data obtained _ oS x. 
7 ~~ when factor Bis held at levelj ~ 7 — u 
ie 
e grand mean > » i 


with observed values X;., x.;, and x... Intuitively, to see whether there is any effect due to the levels of 
factor A, we should compare the observed x;.’s with each other, and information about the different 
levels of factor B should come from the x.;’s. 
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A Two-Factor Fixed Effects Model 
Proceeding by analogy to single-factor ANOVA, one’s first inclination in specifying a model is to let 
[ij = the true average response when (factor A, factor B) = (i, j), giving JJ mean parameters. Then let 


Xi = My + i 


where é, is the random amount by which the observed value differs from its expectation, and the é,’s 
are assumed normal and independent with common variance o*. Unfortunately, there is no valid test 
procedure for this choice of parameters. The reason is that under the alternative hypothesis of interest, 
the y1,;’s are free to take on any values whatsoever and oa” can be any value greater than zero, so that 
there are [J + 1 freely varying parameters. But there are only JJ observations, so after using each x;; as 
an estimate of j1,;, there is no way to estimate oO. 

To rectify this problem of a model having more parameters than observed values, we must specify 
a model that is realistic yet involves relatively few parameters. For the no-replication (Kj = 1) 
scenario, we assume the existence of a parameter pu, J parameters 0, o%2,..., %, and J parameters f,, 
Bo,..., By such that 


Xi = M+ + Bi + by G=1,...05 j=1,...J) (11.7) 
Taking expectations on both sides of (11.7) yields 
My = Ut ut B; (11.8) 


The model specified in (11.7) and (11.8) is called an additive model, because each mean response j4;; 
is the sum of a true grand mean (2), an effect due to factor A at level i («;), and an effect due to factor 
B at level j (f;). The difference between mean responses for factor A at levels i and i’ when B is held at 
level j is 4; — U;; . Critically, when the model is additive, 


Mi — Big = (Ut Gj + Bi) — (Ut oe + Bj) = 04 — oy 


which is independent of the level j of the factor B. A similar result holds for uj; — f4,;. Thus additivity 
means that the difference in mean responses for two levels of one of the factors is the same for all 
levels of the other factor. Figure 11.4a shows a set of mean responses that satisfy the condition of 
additivity (which implies parallel lines), and Figure 11.4b shows a nonadditive configuration of mean 
responses. 


a b 


Mean response Mean response 


raw a 


ee Levels of B Levels of B 


1 2 3 4 2 3 
Levels of A Levels of A 


ry ry 


- 


Figure 11.4 Mean responses for two types of model: (a) additive; (b) nonadditive 
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If additivity does not hold, we say that interaction is present. Factors A and B have an interaction 
effect on the response variable if the effect of factor A on the (mean) response value depends upon the 
level of factor B, and vice versa. The foregoing discussion implies that an additive model assumes 
there is no interaction effect. The graphs in Figure 11.4 are called interaction plots; Figure 11.4a 
displays data consistent with no interaction effect, whereas Figure 11.4b indicates a potentially strong 
interaction. 

When K;; = 1, there is insufficient data to estimate any potential interaction effects, and so the 
additive model specified by (11.7) and (11.8) must be used. In Section 11.5, where K;; > 1, we will 
consider models that include interaction effects. 


Example 11.14 (Example 11.13 continued) When the observed x;;’s are plotted in a manner anal- 
ogous to that of Figure 11.4, we get the result shown in Figure 11.5. Although there is some 
“crossing over” in the observed x;;’s, the configuration is reasonably representative of what would be 
expected under additivity with just one observation per treatment. 


Color change 


2 Brand 1 


Brand 2 


1 2: 3 4 


Washing treatment 


Figure 11.5 Plot of data from Example 11.13 | 


Expression (11.8) is still not quite our final model description, because the «;’s and f,’s are not 
uniquely determined. Following are two different configurations of the o;’s and f,’s (with u = 0 for 
convenience) that yield the same additive ju;’s. 
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Br=l fo=4 Bi =2 Bo =5 
a = 1 My =2 Hi =5 a, = 0 Hy =2 Hi2 = 5 
A, = 2 Lo, = 3 [bo = 6 a =1 bo, = 3 Ion = 6 


By subtracting any constant c from all «;’s and adding c to all f;’s, other configurations corre- 
sponding to the same additive model are obtained. This nonuniqueness is eliminated by use of the 
following model, which imposes an extra constraint on the «;’s and f;,’s. 


TWO-FACTOR ANOVA 
ADDITIVE MODEL 
EQUATION 


Xj = wt oj + B+ ey (11.9) 


where ae a; = 0, ae B; =0, and the ¢j’s are assumed to be 


independent normal rvs with mean 0 and variance o°. 


This is analogous to the alternative choice of parameters for single-factor ANOVA discussed in 
Section 11.3. It is not difficult to verify that (11.9) is an additive model in which the parameters are 
uniquely determined. Notice that there are now only J — 1 independently determined «;’s and J — 1 
independently determined f;’s, so including y Expression (11.9) specifies 7-1) + (J-1) +1 = 
I+ J-—-1 parameters. 

The interpretation of the parameters of (11.9) is straightforward: y is the true grand mean response 
over all levels of both factors; «; is the effect of factor A at level i measured as a deviation from ju; and 
f; is the effect of factor B at level j. Unbiased (and maximum likelihood) estimators for these 
parameters are 


Test Procedures 

There are two different hypotheses of interest in a two-factor experiment with Kj = 1. The first, 
denoted by Ha, states that the different levels of factor A have no effect on true average response. The 
second, denoted by Hog, asserts that there is no factor B effect. 


Aa: ay 00) mse aT 0 
versus H,,: at least one a; 4 0 


Hos: B, = Bp =--- = By = 0 
versus Hag: at least one B; # 0 


No factor A effect implies that all «;’s are equal, so they must all be O since they sum to 0, and 
similarly for the f,’s. The analysis now follows closely that for single-factor ANOVA. The relevant 
sums of squares and associated dfs are given in the accompanying box. 
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I J I 
SSA=S* 0 (X%.-X.) =F @ df =1-1 
i=1 j=1 i=1 
IJ r Fo. 
SSB=S°>S> (Kj-X%.) =1) 58 df =s-1 
i=1 j=l j=l 
Pow 2 
SSE=)~° >> (Xj -X: —Xj;+X.)° df =(1-1)\J-1) 
i=1 j=1 
oe. 


The fundamental ANOVA identity is 


SST = SSA + SSB + SSE 


SSA and SSB take the place of SSTr from single-factor ANOVA. The unwieldy expression for SSE 
results from replacing 1, #;, and f; in )> [Xi a (u + oj + B;)] ‘ by their respective estimators. Error df 
is JJ — [number of mean parameters estimated] = JJ - (7+ J- 1) = d- 1)V— 1). Analogous to single- 
factor ANOVA, total variation is split into a part (SSE) that cannot be attributed to the truth or falsity 
of Ho, and Hog (.e., unexplained variation) and two parts that can be explained by possible falsity of 
the two null hypotheses. 

Forming F ratios as in single-factor ANOVA, it can be shown as in Section 11.1 that if Ho, is true, 
the corresponding ratio has an F distribution with numerator df = 7 — 1 and denominator df = 
(I — 1)VJ — 1); an analogous result applies when testing Hog. 


Hypotheses Test Statistic Value Rejection Region 
Hoa versus Ha, fa = MSA/MSE Sa = Fag-1i-1y(s—-1) 
Hog versus Hap Js = MSB/MSE fp = FyJ3-1,(1-1)\(J-1) 


The corresponding P-values for the two tests are the areas under the associated F curves to the right of 
fa, and fg, respectively. 


Example 11.15 (Example 11.13 continued) The x;.’s (tow means) and x.;’s (column means) for the 
color change data are displayed along the right and bottom margins of the data table in Example 
11.13. In addition, the grand mean is x.. = .466. Table 11.7 summarizes further calculations. 


Table 11.7, ANOVA table for Example 11.15 


Source of Variation df Sum of squares Mean square f P-value 
Factor A (pen brand) I= |=2 SSA = .1282 MSA = .0641 fy, = 4.43 .066 
Factor B (wash treatment) J=153 SSB = .4797 MSB =.1599 fg = 11.05 .007 
Error d-iIVv-l)=6 SSE = .0868 MSE = .01447 


Total W-1=11 SST = .6947 
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The critical value for testing Ho, at level of significance .05 is F956 = 5.14. Since 4.43 < 5.14, 
Ho, cannot be rejected at significance level .05. Based on this (small) data set, we cannot conclude 
that true average color change depends on brand of pen. Because F'95.3.6 = 4.76 and 11.05 > 4.76, 
Ho is rejected at significance level .05 in favor of the assertion that color change varies with wash- 
ing treatment. The same conclusions result from consideration of the P-values: .066 > .05 and 
.007 < .05. 

The plausibility of the normality and constant variance assumptions can be investigated graphi- 
cally by first calculating the predicted values (also called fitted values) Xx and the residuals 
(the differences between the observations and predicted values) e,: 


Sy = fit 6; +B, =x. + (%. —%.) + (%j —%.) =H. +5; -Z. 
ei = Xij — Xi = Xij — Xj. 05 +x.. 


We can check the normality assumption with a normal plot of the residuals, Figure 11.6a, and 
then the constant variance assumption with a plot of the residuals against the fitted values, 
Figure 11.6b. 


a b 
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 
0.15 7 
95 4 6 
90 4 0.10 
e 

80 4 ° ‘ 
= Hs 4 Z ° gs 0.05 
®o 7 e 
© 50 4 a 2 + 
2 40 4 e ® 0.0 

30 4 ° fe 

20 4 yr, 05 

10 4 ‘ ri 

e 
54 ° -0.10 ° 
e 
1 7 T T T 
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Figure 11.6 Residual plots for Example 11.15 


The normal probability plot is reasonably straight, so there is no reason to question normality for 
this data set. In the plot of the residuals against the fitted values, look for differences in vertical spread 
as we move horizontally across the graph. For example, if there were a narrow range for small fitted 
values and a wide range for high fitted values, this would suggest that the variance is higher for 
larger responses (this happens often, and it can sometimes be cured by transforming via logarithms). 
No such problem occurs here, so there is no evidence against the constant variance assumption, 
either. | 
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Expected Mean Squares 

The plausibility of using the F tests just described is demonstrated by determining the expected mean 
squares. After some tedious algebra, 


E(MSE) = o° (when the model is additive) 


E(MSA) = o? + —— 


; . 
E(MSB) =o? + 5° 


When A, is true, MSA is an unbiased estimator of a’, so F "4 1S a ratio of two unbiased estimators of 
a”. When Ho, is false, MSA tends to overestimate a”, SO Ho, should be rejected when the ratio Fy is 
too large. Similar comments apply to MSB and Ao,. 


Multiple Comparisons 

When either Hp, or Hog has been rejected, Tukey’s procedure can be used to identify significant 
differences between the levels of the factor under investigation. The steps in the analysis are identical 
to those for a single-factor ANOVA: 


1. For comparing levels of factor A, obtain QO, 7¢7-1)J-1)- 
For comparing levels of factor B, obtain Q, 7,7—1)-1): 
2. Compute Tukey’s honestly significant difference: 


d = Q- (estimated SD of the sample means being compared) 


Qy.1,(1-1)(s-1) * VW MSE/J for factor A comparisons 
7 Q,.7,1-1)(7-1) * WV MSE/T for factor B comparisons 


(because, e.g., the standard deviation of Xj. = (1/J) 32 Xj is o/VJ). 

3. Arrange the sample means in increasing order, then underscore those pairs differing by less than 
d. Pairs not underscored by the same line correspond to significantly different levels of the given 
factor. 


Example 11.16 (Example 11.15 continued) Identification of significant differences among the four 
washing treatments requires Q 5,46 = 4.90 and d= 4.90,/.01447/3 = .340. The four factor 
B sample means (column averages) are now listed in increasing order, and any pair differing by less 
than .340 is underscored by a line segment: 


X4. X2. X3. Xj. 


300 337 423 803 


Washing treatment | is significantly worse than the other three treatments, but no other significant 
differences are identified. In particular, it is not apparent which among treatments 2, 3, and 4 is best at 
removing marks. 

Notice that Tukey’s HSD is not required for comparing the levels of factor A, since the ANOVA 
F test for that factor did not reveal any statistically significant effect. | 
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Randomized Block Experiments 

In using single-factor ANOVA to test for the presence of effects due to the / different treatments under 
study, once the JJ subjects or experimental units have been chosen, treatments should be allocated in a 
completely random fashion. That is, J subjects should be chosen at random for the first treatment, then 
another sample of J chosen at random from the remaining subjects for the second treatment, and so 
on. 

It frequently happens, though, that subjects or experimental units exhibit differences with respect 
to other characteristics that may affect the observed responses. For example, some patients might be 
healthier than others. When this is the case, the presence or absence of a significant F value may be 
due to these other differences rather than to the presence or absence of factor effects. This was the 
reason for introducing paired experiments in Chapter 10. The generalization of the paired experiment 
to J >2 is called a randomized block design. An extraneous factor, “blocks,” is constructed by 
dividing the J units into J groups (with 7 units in each group) in such a way that within each block, 
the J units are homogeneous with respect to other factors thought to affect the responses. Then within 
each homogeneous block, the J treatments are randomly assigned to the J units or subjects in the 
block. 


Example 11.17 A consumer product-testing organization wished to compare the annual power 
consumption for five different brands of dehumidifier. Because power consumption depends on the 
prevailing humidity level, it was decided to monitor each brand at four different levels ranging from 
moderate to heavy humidity (thus blocking on humidity level). Within each humidity level, brands 
were randomly assigned to the five selected locations. The resulting amount of power consumption 
(annual kWh) appears in Table 11.8, and the ANOVA calculations are summarized in Table 11.9. 


Table 11.8 Power consumption data for Example 11.17 


Blocks (humidity level) 


Treatments (brands) 1 2 3 4 Xj. 

1 685 792 838 875 797.50 
2 722 806 893 953 843.50 
3 133 802 880 941 839.00 
4 811 888 952 1005 914.00 
5 828 920 978 1023 937.25 


Table 11.9 ANOVA table for Example 11.17 


Source of variation df Sum of squares Mean square if 
Treatments (brands) 4 53,231.00 13,307.75 fa = 95.57 
Blocks 3 116,217.75 38,739.25 tp = 278.20 
Error 12 1671.00 139.25 

Total 19 171,119.75 


Since fy = 95.57 > F'os5,4,12 = 3.26, Ho is rejected in favor of Hz. We conclude that power 
consumption does depend on the brand of humidifier. To identify significantly different brands, we 


use Tukey’s procedure; Q95.5.12 = 4.51 and d = 4.51,/139.25/4 = 26.6. 
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X). X3 


797.5 839.00 


X92. 


843.50 


X4. 


914.0 


X5. 


937.2 


The underscoring indicates that the brands can be divided into three groups with respect to power 
consumption. 

Because the blocking factor is of secondary interest, F'o5.312 is not needed, though the computed 
value of F’z is clearly highly significant. Figure 11.7 shows SAS output for this data. Notice that in the 
first part of the ANOVA table, the sums of squares (SS’s) for treatments (brands) and blocks 
(humidity levels) are combined into a single “model” SS. 


Analysis of Variance Procedure 


Dependent Variable: POWERUSE 
Sum of Mean 
Source DF Squares Square F Value Pr > F 
Model ei 169448.750 24206.964 173.84 0.0001 
Error 12 1671.900 139.250 
Corrected Total 19 LILIT9 .750 
R-Square C.V. Root MSE POWERUSE Mean 
0.990235 1.362242 11.8004 866.25000 
Source DF Anova SS Mean Square F Value Pr > F 
BRAND 4 53231.000 13307.750 95::.5:7 0.0001 
HUMIDITY 2 116217.750 38739.250 278.20 0.0001 
Alpha = 0.05 df = 12 MSE = 139.25 
Critical Value of Studentized Range = 4.508 
Minimum Significant Difference = 26.597 
Means with the same letter are not significantly different. 
Tukey Grouping Mean N BRAND 
A 937.250 4 5 
A 
A 914.000 4 4 
B 843.500 4 2 
B 
B 839.000 4 3 
Cc 797.500 4 1 
Figure 11.7 SAS output for consumption data a 


In some experimental situations in which treatments are to be applied to subjects, a single subject 
can receive all J of the treatments. Blocking is then often done on the subjects themselves to control 
for variability between subjects, typically in random order; each subject is then said to act as its own 
control. Social scientists sometimes refer to such experiments as repeated-measures designs. The 
“units” within a block are then the different “instances” of treatment application. Similarly, blocks are 
often taken as different time periods, locations, or observers. 


682 11. The Analysis of Variance 


In most randomized block experiments in which subjects serve as blocks, the subjects actually 
participating in the experiment are selected from a large population. The subjects then contribute 
random rather than fixed effects. This does not impact the procedure for comparing treatments when 
Kj = 1 (one observation per “cell,” as in this section), but the procedure is altered if K,;; > 1. We will 
shortly consider two-factor models in which effects are random. 


More on Blocking When / = 2, either the F test above or the paired differences ¢ test can be used to 
analyze the data. The resulting conclusion will not depend on which procedure is used, since 
T° = F and fry = Fy1y. 

Just as with pairing, blocking entails both a potential gain and a potential loss in precision. If there is a 
great deal of heterogeneity in experimental units, the value of the variance parameter o° in the one-way 
model will be large. The effect of blocking is to filter out the variation represented by o° in the two-way 
model appropriate for a randomized block experiment. Other things being equal, a smaller value of o 
results in a test that is more likely to detect departures from Hp (i.e., a test with greater power). 

However, other things are not equal here, since the single-factor F test is based on [(J — 1) degrees 
of freedom (df) for error, whereas the two-factor F test is based on (J — 1)(J — 1) df for error. Fewer 
degrees of freedom for error results in a decrease in power, essentially because the denominator 
estimator of o7 is not as precise. This loss in degrees of freedom can be especially serious if the 
experimenter can afford only a small number of observations. Nevertheless, if it appears that blocking 
will significantly reduce variability, it is probably worth the loss in degrees of freedom. 


Models for Random Effects 

In many experiments, the actual levels of a factor used in the experiment, rather than being the only 
ones of interest to the experimenter, have been selected from a much larger population of possible 
levels of the factor. In a two-factor situation, when this is the case for both factors, a random effects 
model is appropriate. The case in which the levels of one factor are the only ones of interest while the 
levels of the other factor are selected from a population of levels leads to a mixed effects model. The 
two-factor random effects model when K,; = 1 is 


Xi = Wt Aj, + Bit éy @jhi8 ds fa eee O 


where the A;’s, B;’s, and ¢;;’s are all independent, normally distributed rvs with mean 0 and variances 
ee a, and a, respectively. 

The hypotheses of interest are then Hoa: o% = 0 (level of factor A does not contribute to variation 
in the response) versus H,4: 03 > 0 and Hog: 0% = 0 versus Hyg: 07, > 0. Whereas E(MSE) = o° as 


before, the expected mean squares for factors A and B are now 
E(MSA) =o? +Jo, and E(MSB) = 07 + Io% 


Thus when Ho, (Hop) is true, F'4 (Fg) is still a ratio of two unbiased estimators of a”. It can be shown 
that a test with significance level « for Ho, versus Hy, still rejects Ho, if fg > Fap-1a-1yy-1) and, 


similarly, the same procedure as before is used to decide between Hog and Hyp. 
For the case in which factor A is fixed and factor B is random, the mixed model is 
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where > a; = 0, and the B;’s and ¢;;’s are all independent, normally distributed rvs with mean 0 and 
variances 0% and a, respectively. Now the two null hypotheses are 


Hoa: &) = +--+: =a,=0 and Hog: 6% = 0 


Expected mean squares are 


E(MSA) = 0? + > a and E(MSB) = o*+Ioz 


The test procedures for Ho, versus H,4 and Hog versus Hp are exactly as before. For example, in the 
analysis of the color change data in Example 11.13, if the four wash treatments were randomly 
selected, then because fg = 11.05 > Fo5.3,6 = 4.76, Hog: on = 0 is rejected in favor of Hyp: ee > 0. 

Summarizing, when K;; = 1, although the hypotheses and expected mean squares differ from the 
case of both effects fixed, the test procedures are identical. 


Exercises: Section 11.4 (45-60) 


45. 


46. 


The number of miles of useful tread wear 
(in 1000’s) was determined for tires of each 
of five different makes of subcompact car 
(factor A, with J = 5) in combination with 
each of four different brands of radial tires 
(factor B, with J = 4), resulting in [J = 20 
observations. The values SSA = 30.6, 
SSB = 44.1, and SSE = 59.2 were then 
computed. Assume that an additive model 
is appropriate. 


a. Test Ho: ) = 0 =03=0,=05=0 
(no differences in true average tire life- 
time due to makes of cars) versus H,: at 
least one «; 4 0 using a level .05 test. 

b. Test Ho: fi = By = Bs = By =0 (no 
differences in true average tire lifetime 
due to brands of tires) versus H,;: at least 
one f; #4 0 using a level .05 test. 


Four different coatings are being considered 
for corrosion protection of metal pipe. The 
pipe will be buried in three different types 
of soil. To investigate whether the amount 
of corrosion depends either on the coating 
or on the type of soil, 12 pieces of pipe are 
selected. Each piece is coated with one of 
the four coatings and buried in one of the 
three types of soil for a fixed time, after 
which the amount of corrosion (depth of 


47. 


maximum pits, in .0001 in.) is determined. 
The depths are shown in this table: 


Soil type (B) 
I 2 3 
1 64 49 50 
2 53 51 48 
Coating (A) 3 47 45 50 
4 51 43 52 


a. Assuming the validity of the additive 
model, carry out the ANOVA analysis 
using an ANOVA table to see whether 
the amount of corrosion depends on 
either the type of coating used or the 
type of soil. Use « = .05. 


b. Compute ji, %1, %2, 3, &4, 8), B, and Bs. 


The article “Step-Counting Accuracy of 
Activity Monitors in Persons with Down 
Syndrome” (J. Intellect. Disabil. Res. 2019: 
21-30) describes a study in which 17 people 
with DS walked for a set time period with 
multiple step-counting devices attached to 
them. (Walking is a common form of exer- 
cise for people with DS, and clinicians want 
to insure that step counts for them are 
accurate.) The accompanying table sum- 
marizes the different step-counting methods 
and the number of steps recorded for each 
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participant. (LFE refers to a low frequency 
extension filter applied to the device.) 


Step-counting method Mean SD 
Hand tally 668 70 
Pedometer 537 202 
Hip accelerometer 466 159 
Hip accelerometer + LFE 606 93 
Wrist accelerometer 449 89 
Wrist accelerometer + LFE 579 85 


48. 


Sums of squares consistent with this and 
other information in the article include 
SS(Method) = 596,748, SSE = 987,380, 
and SST = 2,113,228. 


a. Determine the sum of squares associated 
with the blocking variable (subject), and 
then construct an ANOVA table. 

b. Assuming that model assumptions are 
plausible, test the null hypothesis of “no 
method effect” at the .01 significance level. 

c. Apply Tukey’s procedure to these six step- 
counting methods. Are any of the five 
device-based methods not significantly 
different from hand tally (considered by 
the researchers to be the most correct)? 


In an experiment to see whether the amount 
of coverage of light-blue interior latex paint 
depends either on the brand of paint or on 
the brand of roller used, | gallon of each of 
four brands of paint was applied using each 
of three brands of roller, resulting in the 
following data (number of square feet 
covered). 


Roller brand 
1 2 3 
1 454 446 451 
Paint 2 446 444 447 
brand 3 439 442 444 
4 444 437 443 


a. Construct the ANOVA table. [Hint: The 
computations can be expedited by sub- 
tracting 400 (or any other convenient 
number) from each observation. This 
does not affect the final results. ] 

b. Check the normality and constant vari- 
ance assumptions graphically. 


49. 


50. 
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c. State and test hypotheses appropriate for 
deciding whether paint brand has any 
effect on coverage. Use « = .05. 

d. Repeat part (c) for brand of roller. 

e. Use Tukey’s method to identify signif- 
icant differences among brands. Is there 
one brand that seems clearly preferable 
to the others? 


The following data is presented in the 
article “Influence of Cutting Parameters on 
Drill Bit Temperature” (Ind. Lubr. Tribol. 
2007: 186-193); values in the table are 
temperatures in °C. 


Feed rate 
1 2 3 
1 275 325 365 
Spindle speed 2 380 415 420 
3 425 420 405 


a. Construct an ANOVA table from this 
data. 

b. Test whether spindle speed impacts drill 
bit temperature at the .05 significance 
level. 

c. Test whether feed rate impacts drill bit 
temperature at the .05 significance level. 


A particular county employs three assessors 
who are responsible for determining the 
value of residential property in the county. 
To see whether these assessors differ sys- 
tematically in their assessments, 5 houses are 
selected, and each assessor is asked to 
determine the market value of each house. 
With factor A denoting assessors (J = 3) and 
factor B denoting houses (J = 5), suppose 
SSA = 11.7,SSB = 113.5, and SSE = 25.6. 


a. Test Ho: % = o& = a3 = 0 at level .05. 
(Ho states that there are no systematic 
differences among assessors.) 

b. Explain why a _ randomized block 
experiment with only 5 houses was used 
rather than a one-way ANOVA experi- 
ment involving a total of 15 different 
houses with each assessor asked to 
assess 5 different houses (a different 
group of 5 for each assessor). 
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51. 


52. 


Torque 18 


In a 2018 class activity, 54 students mea- 
sured how much time (sec) it took to melt 
each of the following in their mouths: (1) a 
butterscotch chip, (2) a chocolate chip, (3) a 
white chip (yes, white is a chip flavor). 
Each student rolled a die to determine the 
order in which to melt the chips. 


a. Why was it important to randomize the 
chip order? 

b. Besides not having to recruit as many 
students to get the same number of 
observations, what is the advantage of 
using blocking here, versus randomly 
assigning one chip to each student? 

c. Summary quantities include x,. = 88.15, 
X2. = 60.49, x3. = 72.35, SS(Subject) = 
135,833, and SSE = 31,506. Construct 
an ANOVA table and test at signifi- 
cance level .01 to see whether mean 
melting time varies by chip flavor. 

d. Judging from the F ratio for subjects, do 
you think that blocking on subjects was 
effective in this experiment? Explain. 


The efficiency (%) of a 0.7 L Daihatsu 
diesel engine at 3100 rpm was determined 
at various torques (N-m) and coolant tem- 
peratures (°C), resulting in the following 
data kindly provided by the study’s authors 
(“Performance of a Diesel Engine at High 
Coolant Temperatures, J. Energy Resour. 
Technol. 2017). 


Coolant Temp. (°C) 
90 100 125 150 175 


12 24.07 23.81 23.55 22.84 24.34 
15 26.52 26.00 26.06 25.33 26.92 
28.50 28.23 26.67 26.42 28.94 
21 28.61 29.96 28.38 27.16 29.35 
24 28.47 28.34 28.20 26.55 28.87 


200 


24.62 
27.08 
28.74 
29.16 
27.27 


a. Test at the .01 significance level whether 
mean engine efficiency differs with torque. 

b. Test at the .01 significance level whe- 
ther mean engine efficiency differs with 
coolant temperature. 

c. Apply Tukey’s procedure as appropriate 
to the results of (a) and (b). 


33; 
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An experiment was conducted to assess the 
effect of current and voltage on the tensile 
strength (ksi) of welds made using a tung- 
sten inert gas (TIG) welding tool, which 
resulted in the following data. 


Voltage 
10 12 14 
130 197.62 200.35 199.40 
Current 135 185.90 215.56 179.36 
140 203.23 174.81 194.47 


54. 


Use two-factor ANOVA to determine whe- 
ther current or voltage impacts the tensile 
strength of welds under these experimental 
conditions at the .10 significance level. (Data 
is from “To Investigate the [E]ffect of Process 
Parameters on Mechanical Properties of TIG 
Welded 6351 Aluminum Alloy by ANOVA,” 
GE-Int. J. Engr. Res. 2014: 50-62.) 


The article “Effect of Face Value on Pro- 
duct Valuation in Foreign Currencies” 
(J. Consum. Res. 2002) describes a class- 
room experiment involving 97 business 
students to see whether they could adjust 
for exchange rates when deciding how 
much to spend a product. The students 
acted as buyers in a mock World Garment 
Expo at which each would be purchasing 
silk ties from six different nations. They 
were provided pictures of the ties and the 
exchange rates for each national currency 
into US$; the pictures were randomly per- 
muted to reduce any perceived quality dif- 
ferences. Students then had to report how 
much, in the foreign currencies, they would 
pay for one silk tie. The average prices 
students were willing to pay, converted 
back into dollars, appear in the accompa- 
nying table; exchange rates are the number 
of foreign currency units (e.g., Norwegian 
krone or Japanese yen) equaling $1. 


Country (Exch. rate) Mean SD 

Norway (9.5) $15.85 $15.36 
Luxembourg (48) $15.45 $13.47 
Japan (110) $15.91 $11.26 
Korea (1100) $12.42 $10.17 
Romania (24,500) $11.33 $12.05 
Turkey (685,000) $10.77 $9.01 
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D2. 


Relevant sums of squares include 
SS(Country) = 2752, SSE = 45,086, and 
SST = 86,653. 


a. What experimental design was used in 
this study? What was the advantage of 
using this method? 

b. Construct an ANOVA table from the 
information provided, then test the null 
hypothesis of “no currency exchange 
effect” at the .01 level. 

c. The exchange rates vary by orders of 
magnitude (this was deliberate). As the 
exchange rate increases, what happens 
to the average amount students are 
willing to pay for a silk tie in that cur- 
rency? (The article’s authors note that 
“th[is] evidence is against the common 
wisdom that when the home currency is 
perceived to go a long way in foreign 
currency terms, the foreign currency 
will be treated as play money and that 
people will overspend.”) 


The strength of concrete used in commer- 
cial construction tends to vary from one 
batch to another. Consequently, small test 
cylinders of concrete sampled from a batch 
are “cured” for periods up to about 28 days 
in temperature- and moisture-controlled 
environments before strength measure- 
ments are made. Concrete is then “bought 
and sold on the basis of strength test 
cylinders” (ASTM C 31 Standard Test 
Method for Making and Curing Concrete 
Test Specimens in the Field). The accom- 
panying data resulted from an experiment 
carried out to compare three different curing 
methods with respect to compressive 
strength (MPa). Analyze this data. 


Batch Method A Method B Method C 
1 30.7 33.7 30.5 
2: 29.1 30.6 32.6 
3 30.0 32.2 30.5 
4 31.9 34.6 33.5 
5 30.5 33.0 32.4 
6 26.9 29.3 27.8 

(continued) 
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Batch Method A Method B Method C 
fi 28.2 28.4 30.7 
8 32.4 32.4 33.6 
9 26.6 29.5 29.2 
10 28.6 29.4 33.2 
56. Check the normality and constant variance 


Df. 


58. 


59. 


60. 


assumptions graphically for the data of 
Example 11.17. 


Suppose that in the experiment described in 
Exercise 50 the five houses had actually 
been selected at random from among those 
of a certain age and size, so that factor B is 
random rather than fixed. Test Ho: o% = 0 
versus H,: a, > 0 using a level .01 test. 
a. Show that aconstant d can be added to (or 
subtracted from) each x;; without affect- 
ing any of the ANOVA sums of squares. 
b. Suppose that each x, is multiplied by a 
nonzero constant c. How does this affect 
the ANOVA sums of squares? How 
does this affect the values of the 
F statistics F', and Fz ? What effect does 
“coding” the data by y,, = cx + d have 
on the conclusions resulting from the 
ANOVA procedures? 
Use the fact that E(Xj) =u+a;+ 8; 
with '%=)>°8;=0 to show that 
E(X,. —X..) = 4;, so that a = X;. —X.. is 
an unbiased estimator for «;. 
Power for the F test in two-factor ANOVA 
is calculated using a similar method to the 
one shown in Section 11.3. For fixed values 
Of 0), O%, ..., %, power calculations are 
based on the noncentral F distributions with 
parameters v; =7—-1, vo =(7— 1)J— 1), and 
noncentrality parameter 2 = J S~ a? /o?. 


a. For the corrosion experiment described 
in Exercise 46, determine power when 
a =4, % =0, #=%4,= —2, and 
o=4. Repeat for «,=6, a =0, 
a3 = 4 = —3, ando = 4. 

b. By symmetry, what is the power for the test 
of Hog versus H,, in Example 11.13 when 
fb, = 3, Bo = B3 = By =—.1, anda = .3? 
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11.5 Two-Factor ANOVA with Replication 


In Section 11.4, we analyzed data from a two-factor experiment in which there was one observation 
for each of the JJ combinations of levels of the two factors (i.e., no replication). To obtain valid test 
procedures in that situation, the j;’s were assumed to have an additive structure, meaning that the 
difference in true average responses for any two levels of the factors is the same for each level of the 
other factor. This was shown in Figure 11.4a, in which the lines connecting true average responses 
are parallel. 

Figure 11.4b depicted a set of true average responses that does not have additive structure. The 
lines connecting these ;;s are not parallel, which means that the difference in true mean responses 
for different levels of one factor does depend on the level of the other factor—what’s known as an 
interaction. When Kj > 1 for at least one (i, j) pair, we can estimate the interaction effect and 
formally test for whether interaction is present. 

In specifying the appropriate model and deriving test procedures, we will focus on the case 
K,; = K > 1, so the number of observations per “cell” (i.e., for each combination of levels) is constant. 
That is, throughout this section we will assume a balanced study design. 


A Two-Factor Model With Interaction 

Again 4; will denote the true mean response when factor A is at level i and factor B is at level j. 
Expressions (11.7)-(11.9) show the development of a model equation that assumes additivity (i.e., no 
interaction). To extend this model, first let 


n= me =5 0 Hy ty = Dy (11.10) 
L J J i 


Thus yu is the expected response averaged over all levels of both factors (the true grand mean), p;. is 
the expected response averaged over levels of factor B when factor A is held at level i, and similarly 
for .;. Now define three sets of parameters by 


oj = MW; — 4 = the effect of factor A at level i 
B; = “4; — M= the effect of factor B at level j (11.11) 


Vy = My — (U+ a+ B;) 
from which 


Mig = BAG BT Vy 


The «;’s and f;’s are the same as those from Section 11.4. The «;’s are called the main effects for 
factor A, and the f;’s are the main effects for factor B. The new parameters, the y;;’s, measure the 
difference between the true treatment means y;; and the means assumed under the additive model 
(11.8). The ,’s are referred to as the interaction parameters, and the model is additive if and only if 
all y,;’s = 0. 
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Although there are J «;’s, J B;’s; and IJ ¥,;’s in addition to p, the conditions Ya =0, 5 B; = 0, 
yi iz = 0 for every i, and >°, y,; = 0 for any j—all true by virtue of (11.10) and (11.11)—imply that 
only JJ of these new parameters are independently determined: yu, J — 1 of the «;’s, J— 1 of the f;’s, 
and (J — 1)(J — 1) of the »,’s. 

We now must use triple subscripts for both random variables and observed values: Xj, and xj, 
denote the kth observation (replication) when factor A is at level i and factor B is at level j. 


TWO-FACTOR ANOVA 
GENERAL MODEL Xie = Wj + Bi + Vy + Fie 


11.12 
EQUATION PE Qisk feta: bo tagk “ae 


pees 


where the é;,’s are independent and normally distributed, each with 
mean 0 and variance a”. 


Example 11.18 Three different varieties of tomato (Harvester, Ife No. 1, and Pusa Early Dwarf) and 
four different plant densities (10, 20, 30, and 40 thousand plants per hectare) are being considered for 
planting in a particular region. To see whether either variety or plant density affects yield, each 
combination of variety and plant density is used in three different plots, resulting in the data on yields 
in Table 11.10. 


Table 11.10 Yield data for Example 11.18 
Planting density 


Variety 10,000 20,000 30,000 40,000 

H 10.5 9.2 79 12.8 11.2 13,3 12.1 12.6 14.0 10.8 9.1 12.5 
Ife 8.1 8.6 10.1 12.7 13.7 11.5 14.4 15.4 13.7 11.3 12.5 14.5 
P 16.1 15.3 17.5 16.6 19.2 18.5 20.8 18.0 21.0 18.4 18.9 17.2 


Here, J = 3, J = 4, and K = 3, for a total of IJK = 36 observations. If we identify factor A = variety 
and B = density, then the observations across the first row of Table 11.10 are x,;; = 10.5, x42 = 9.2, 
X113 = 7.9, x12; = 12.8, and so on. Some of the parameters specified in the model equation (11.12) 
include 


LL = true average yield of all tomato plants in this population 
[4.= true average yield of all Harvester plants (i = 1) in this population 
fz = the effect of 20,000 plants/hectare density (j = 2) on average yield 


To check the normality and constant variance assumptions, we can make plots similar to those of 
Section 11.4. Define the predicted/fitted values to be the cell means, *j, = Xj., so the residuals are 
C ijk = Xijk — Xijk = Xijk — Xi.. For example, the mean of the three observations in the top-left cell of 
Table 11.10 is Xj). = (10.5+9.2+7.9)/3 = 9.2, and the residual for the very first observation is 
11 = X11 — X11. = 10.5—9.2 = 1.3. The normal probability plot of the 36 residuals is Figure 11.8a, 
and the plot of the residuals against the fitted values is Figure 11.8b. The normal plot is sufficiently 
straight that there should be no concern about the normality assumption. The plot of residuals against 
predicted values has a fairly uniform vertical spread, so there is no cause for concern about the 
constant variance assumption. 


11.5 Two-Factor ANOVA with Replication 


a 


Percent 


b 
Normal Probability Plot of the Residuals 
(response is Yield) 
o: 
e 

! f E 
i 3 
wn 
i ; & 

a 

gy.s 

id 


-2 -1 0 1 2 3 
Residual 


Residuals Versus the Fitted Values 
(response is Yield) 


689 


10 12 14 16 18 
Fitted Value 


Figure 11.8 Plots from Minitab to verify assumptions for Example 11.18 


Sums of Squares and Test Procedures 
There are now three relevant pairs of hypotheses: 


Hoas: Vy = Ofor alli,j versus 


Apa: % = 02 = = Oy 
Hos: Bi = Bz 


Bb, =90 


OQ versus 


Aig: at least one y;j #0 
Hiya: at least one a; 4 0 


versus Hg: at least one B; 4 0 


20 


The no-interaction hypothesis Ho, is usually tested first. If Ho,g is not rejected, then the other two 
hypotheses can be tested to see whether the main effects are significant. But once Ho, z is rejected, we 
believe that the effect of factor A at any particular level depends on the level of B (and vice versa). It 
then does not make sense to test Ho, or Hog. In this case, an interaction plot similar to that of 
Figure 11.4b is helpful in visualizing the way the factors interact. 

To test the hypotheses of interest, we again define sums of squares and indicate their corre- 
sponding degrees of freedom. Again, a dot in place of a subscript means that we have summed over 


all values of that subscript, and a horizontal bar denotes averaging. So, for example, Xj. denotes the 


mean of the K observations in the (i, j)th cell of the data table, while X;.. represents the average of all 
JK values in the ith row. 


SSA=S°S°S) (%.. - X..? 
ij ok 

SSB=S°S°S0 (Xj. -X..)? 
ij ok 


df =/-—1 


df=J-—1 


SSAB=S°S°S 0 (Xj. — Xi. —Xj,+X..)? df =(1- 1)(J-1) 
ij ek 


SsE = 9 a —%p 
i oj) Ck 

sst = 9 Ow -¥,) 
ijk 


df = VK — IJ 


df = VK — 1 
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SSAB is called the interaction sum of squares. Mean squares are, as always, 
defined by (sum of squares)/df. 
The fundamental ANOVA identity is 


SST = SSA+ SSB + SSAB + SSE 


According to the fundamental identity, variation is partitioned into four pieces: unexplained 
(SSE—which would be present whether or not any of the three null hypotheses was true) and three 
pieces that may be explained by the truth or falsity of the three Ho’s. 

The expected mean squares suggest how each set of hypotheses should be tested using the 
appropriate ratio of mean squares with MSE in the denominator: 


E(MSE) = o° 
K I 
E(MSA) = 07+ aes 7 
1-15 
IK < 
I= 
K I J 
E(MSAB) = 0° + _—_____S°S 7», 


Each of the three mean-square ratios can be shown to have an F distribution when the associated Ho is 
true, which yields the following level « test procedures. 


Hypotheses Test Statistic Value Rejection Region 
Aa versus Alaa fa = MSA /MSE SA = Fy 7-11(K-1) 
Hop versus Aap fa = MSB/MSE iB > Fy J-11(K-1) 
Hoag versus Hap fas = MSAB/MSE fas = Fy(1-1)(-1),U(K-1) 


As before, the results of the analysis are summarized in an ANOVA table. 


Example 11.19 (Example 11.18 continued) The cell, row, column, and grand means for the given 
data are 


10,000 20,000 30,000 40,000 Xj. 
9.20 12.43 12.90 10.80 11.33 
Tfe 8.93 12.63 14.50 12.77 12.21 
P 16.30 18.10 19.93 18.17 18.13 
Xj. 11.48 14.39 15.78 13.91 x... = 13.89 


Table 11.11 summarizes the resulting ANOVA computations. 
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Table 11.11 ANOVA table for Example 11.19 


Source of variation df Sum of squares Mean square f P-value 
Varieties 2 327.60 163.80 Ja = 103.02 <.0001 
Density 3 86.69 28.90 te = 18.18 <.0001 
Interaction 6 8.03 1.34 Sag = .84 551 
Error 24 38.04 1.59 

Total 35 460.36 


Since fag = .84 < F-o1.624 = 3.67, Hoag cannot be rejected at level .01, so we conclude that the 
interaction effects are not significant. Now the presence or absence of main effects can be investi- 
gated. Since fy = 103.02 > Fo1.2,24 = 5.61, Ho, is rejected at level .01 in favor of the conclusion 
that different varieties do affect the true average yields. Similarly, fg = 18.18 > 4.72 = F-91324, so 
we conclude that true average yield also depends on plant density. 

Figure 11.9 shows the interaction plot. Notice the nearly parallel lines for the three tomato vari- 
eties, in agreement with the F test showing no significant interaction. The yield for Pusa Early Dwarf 
appears to be significantly above the yields for the other two varieties, and this is in accord with the 
highly significant F for varieties. Furthermore, all three varieties show the same pattern in which yield 
increases as the density goes up, but decreases beyond 30,000 per hectare. This suggests that planting 
more seed will increase the yield, but eventually overcrowding causes the yield to drop. 


Variety 


—+—H -a—lIfe --e-P 


10000 20000 30000 40000 
Density 


Figure 11.9 Interaction plot for the tomato yield data a 


Multiple Comparisons 

When the no-interaction hypothesis Ho,g is not rejected and at least one of the two main-effect null 
hypotheses is rejected, Tukey’s method can be used to identify significant differences in levels. To 
identify differences among the «;’s when Ho, is rejected: 


1. Obtain Q,717;x-1); the second subscript J identifies the number of levels being compared and the 
third subscript refers to the error df. 
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2. Compute d = Q- ,/MSE/JK; JK is the number of observations averaged to obtain each of the 
X;.."s to be compared in step 3. 

3. Order the x;..s from smallest to largest and, as before, underscore all pairs that differ by less than 
d. Pairs not underscored correspond to significantly different levels of factor A. 


To identify different levels of factor B when Hog is rejected, replace the second subscript in Q by J, 
replace JK by JK in d, and replace x;.. by x,. 


Example 11.20 (Example 11.19 continued) For factor A (varieties), J = 3, so with « = .01 and 
IJ(K — 1) = 24, Qo13.24 = 4.55. Then d = 4.55,/1.59/12 = 1.66, so ordering and underscoring give 


Xy.. Xp. X3.. 
11.33 12.21 18.13 


The Harvester and Ife varieties do not differ significantly from each other in effect on true average 
yield, but both differ from the Pusa variety. 


For factor B (density), J = 4 so Qo1,4.24 = 4.91 and d = 4.91,/1.59/9 = 2.06. 


Xa. X.4. X.2. X.3. 
11.48 13.91 14.39 15.78 


Thus with experimentwise error rate .01, only the lowest density differs significantly from all others. 
Even with « = .05 (so that d = 1.64), densities 2 and 3 cannot be judged significantly different from 
each other in their effect on yield. a 


Models with Mixed and Random Effects 

In some situations, the levels of either factor may have been chosen from a large population of 
possible levels, so that the effects contributed by the factor are random rather than fixed. As in 
Section 11.4, if both factors contribute random effects, the model is referred to as a random effects 
model, whereas if one factor is fixed and the other is random, a mixed effects model results. We will 
now consider the analysis for a mixed effects model in which factor A (rows) is the fixed factor and 
factor B (columns) is the random factor; the case in which both factors are random is dealt with in 
Exercise 73. When either factor is random, interaction effects will also be random, and the mixed 
effects model is 


X= Ut G+ Bet Git eye 
PS Tyesgds JShavgdy KS lyk 
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Here yw and the «;’s are constants with 5+ «; = 0 and the B;s, Gj’s, and &,’s are independent, 


normally distributed random variables with expected value 0 and variances 0%, ae and a”, respec- 


tively.” The three hypotheses of interest are 


Apa: 4 =++- =o, =0 versus Aya: at least one «; 4 0 
. 2 ) 

Hog: 6, =0 versus Hg: a; > 0 

Ag: on =0 versus Hyg: on >0 


It is customary to test Ho, and Hog only if the no-interaction hypothesis Hog cannot be rejected. 
The relevant sums of squares and mean squares needed for the test procedures are defined and 
computed exactly as in the fixed effects case. The expected mean squares for the mixed model are 


E(MSE) = o? 

K 
E(MSA) = o* + Kog+ — a? 
E(MSB) = o? + Koz + [Koy 
E(MSAB) = 0? + Kaz 


Thus, to test the no-interaction hypothesis, the ratio {4g = MSAB/MSE is again appropriate, with 
Hog rejected if fag = Fs7—1)(y-1),u(K—1)- However, for testing Ho, versus H,,4, the expected mean 
squares suggest that although the numerator of the F ratio should still be MSA, the denominator 
should be MSAB rather than MSE. MSAB is also the denominator of the F ratio for testing Hz. 


For testing Ho, versus H,, (factor A fixed, B random), the test statistic value is 
fa = MSA/MSAB, and the rejection region is f4 > Fy7—~1,7—-1)(7-1)- The test of Hog 
versus Hz utilizes fg = MSB/MSAB, and the rejection region is fg > Fyj—1,(7—-1)(y-1)- 


Example 11.21 A process engineer has identified two potential causes of electric motor vibration, 
the material used for the motor casing (factor A) and the supply source of bearings used in the motor 
(factor B). The accompanying data on the amount of vibration (microns) resulted from an experiment 
in which motors with casings made of steel, aluminum, and plastic were constructed using bearings 
supplied by five randomly selected sources. 


Supply source 


Material 1 2 3 4 5 

Steel 13.1 13.2 16.3 15.8 13.7 14.3 15.7 15.8 13.5 12:5 
Aluminum 15.0 14.8 15.7 16.4 13.9 14.3 13.7 14.2 13.4 13.8 
Plastic 14.0 14.3 172 16.7 12.4 12.3 14.4 13.9 13.2 13.1 


?This is referred to as an “unrestricted” model. An alternative “restricted” model requires that y; Gi = 0, so the Gj’s 


are no longer independent. Expected mean squares and F ratios appropriate for testing certain hypotheses depend on 
the choice of model. 
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Only the three casing materials used in the experiment are under consideration for use in pro- 
duction, so factor A is fixed. However, the five supply sources were randomly selected from a much 
larger population, so factor B is random. The relevant null hypotheses are 


Apa: 0% = % = 03 = 0 Hog: 6% = 0 Hog: 64 = 0 


Minitab output appears in Figure 11.10. Notice that f4 = 0.24 = 0.3523/1.4507 = MSA/MSAB, not 
MSA/MSE as in a test with all fixed effects. 


Factor Information 


Factor Type Levels Values 
Material Fixed 3 Aluminum, Plastic, Steel 
Supplier Random 5 1,2,3,4,5 


Analysis of Variance 


Source DF Ss MS F-Value_ P-Value 
Material 2 0.7047 0.3523 0.24 0.790 
Supplier 4 36.6747 9.1687 6.32 0.013 
Material*Supplier 8 11.6053 1.4507 13.03 0.000 
Error 15 1.6700 0.1113 

Total 29 50.6547 


Figure 11.10 Minitab output for the data of Example 11.21 


The included 0.000 P-value for interaction means that it is less than .0005 (the actual value is 
.000018). To interpret the significant interaction we use the interaction plot, Figure 11.11, which has 
both versions, one with source on the x-axis and one with material on the x-axis. Interaction is 
evident, because the best material (the one with the least vibration) depends strongly on source. For 
source | the best material is steel, for source 3 the best material is plastic, and for source 4 the best 
material is aluminum. Because of this interaction, we ordinarily would not interpret the main effects, 
but one cannot help noticing that there is strong dependence of vibration on source. Source 2 is bad 
for all three materials and source 3 is pretty good for all three materials. When one-way ANOVA 
analyses are done to compare the five sources for each of the three materials, all three show highly 
significant differences. This is consistent with the P-value of 0.013 for supplier in Figure 11.10. We 
can conclude that, although the interaction causes the best material to depend on the supply source, 
the source also makes a difference of its own. 
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Figure 11.11 Minitab interaction plot for the data of Example 11.21 a 


Final Comments on Two-Factor ANOVA 

Power and sample size calculations for two-factor ANOVA are even more unwieldy than those for the 
single-factor case—software is essential for such computations. The pwr2 package in R includes 
graphical and computational tools for balanced two-way designs, and PROC GLMPOWER in SAS 
has similar functionality. 

When at least two of the K;;’s are unequal, the ANOVA computations are much more complex 
than for the case Kj; = K, and there are no nice formulas for the appropriate test statistics. Most 
software packages analyze unbalanced data by using a broader framework called the general linear 
model, which also encompasses the methods of Chapter 12. The references by Kutner et al., Miller, 
Montgomery, or Ott and Longnecker in the bibliography can be consulted for more information. 


Exercises: Section 11.5 (61-73) 


61. In an experiment to assess the effects of d. Test Aog: B, = fb. = BP; = By =0 ver- 


curing time (factor A) and type of mix (factor 
B) on the compressive strength of hardened 
cement cubes, three different curing times 
were used in combination with four different 
mixes, with three observations obtained for 
each of the 12 curing time—mix combina- 
tions. The resulting sums of squares were 
computed to be SSA = 30,763.0, SSB = 
34,185.6, SSE = 97,436.8, and SST = 
205,966.6. 


a. Construct an ANOVA table. 

b. Test at level .05 the null hypothesis Ho,3: 
all y,’s = 0 (no interaction of factors) 
against Hoag’ at least one y;; A 0. 

c. Test at level .05 the null hypothesis Hoa: 
) = & = 43 = O (factor A main effects 
are absent) against H,,: at least one 


a; # 0. 


sus Hp: at least one B; #0 using a 
level .05 test. 

e. The values of the X;.’s were X)..= 
4010.88, xX2.. = 4029.10, and x3.. = 3960.02. 
Use Tukey’s procedure to investigate 
significant differences among the three 
curing times. 


62. In an experiment described in the article 


“The Impact of SMS Advertising on 
Members of a_ Virtual Community” 
(J. Advert. Res. 2008: 363-374), research- 
ers worked with an online gaming forum to 
send out messages advertising a deal on 
Subway sandwiches. The message was 
varied in two ways: the apparent 
spokesperson (either Subway or “Nik,” a 
made-up forum member) and the language 
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used (full English or texting shorthand). 
Forum members were randomly assigned to 
receive one of the four possible messages 
(spokesperson-language pairs); recipients 
were later surveyed about their views on the 
advertisement (attractiveness, credibility, 
etc.). Assume the researchers obtained K = 12 
survey responses for each message (which 
approximates the actual study results). 


a. Credibility ratings were translated into 
normalized scores for each participant, 
with positive values meaning the mes- 
sage was more credible (zero neutral, 
negative less credible). Mean normal- 
ized credibility scores appear below. 


Credibility Full English Texting shorthand 
Subway 2729 3465 
Nik -.4076 -.1689 


Construct an interaction plot, and 
describe what you see. 

b. With A = spokesperson and B = lan- 
guage, sums of squares include SSA = 
4.2905, SSB = 0.2926, SSAB = 0.0818, 
and SSE = 36.5930. Perform a complete 
two-factor ANOVA, testing each of the 
three possible effects hypotheses at the 
.05 level. Explain what you discover. 

c. Researchers also the measured “pur- 
chase intention” of each subject, with 
higher numbers indicating a greater 
likelihood of buying a Subway sand- 
wich. Use the information below to 
create an interaction plot and perform a 
two-factor ANOVA as in parts (a)-(b). 
Again, explain your findings. 


Purchase intention Full English Texting shorthand 


3.949 5.417 
4.389 4.056 


Subway 
Nik 


SSA = 2.545 SSB = 3.865 
SSAB = 9.731 SSE = 112.409 


63. 


64. 
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The accompanying data resulted from an 
experiment to investigate whether yield 
from a chemical process depended either on 
the formulation of a particular input or on 
mixer speed. 


Speed 
60 70 80 
189.7 185.1 189.0 
1 188.6 179.4 — 193.0 
Aasauicnes 190.1 177.3 191.1 
165.1 161.7 163.3 
2 165.9 = 159.8 166.6 
167.6 161.6 — 1703 
A Statistical computer package gave 


SS(Form) = 2253.44, SS(Speed) = 230.81, 
SS(Form*Speed) = 18.58, and SSE= 
71.87. 


a. Does there appear to be interaction 
between the factors? 

b. Does yield appear to depend on either 
formulation or speed? 

c. Calculate estimates of the main effects. 

d. Verify that the residuals are 0.23, —0.87, 
0.63, 4.50, —1.20, —3.30, —2.03, 1.97, 
0.07, —1.10, —0.30, 1.40, 0.67, —1.23, 
0.57, —3.43, —0.13, 3.57. 

e. Construct a normal plot from the resid- 
uals given in part (d). Do the é,,’s 
appear to be normally distributed? 

f. Plot the residuals against the predicted 
values (cell means) to see if the population 
variance appears reasonably constant. 


Artificial human joints are usually secured 
with acrylic bone cement. The following 
data on the force (in Newtons) required to 
break an acrylic cement bond under differ- 
ent temperatures and media is consistent 
with the information in the article “Vali- 
dation of Small-Punch Test as a Technique 
for Characterizing the Mechanical Proper- 
ties of Acrylic Bone Cement” (J. Engr. 
Med. 2006: 11-21): 
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Temp. (°C) Medium Breaking force data (N) 

22 Dry 100.8, 141.9, 194.8, 
118.4, 176.1, 213.1 

37 Dry 302.1, 339.2, 288.8, 
306.8, 305.2, 327.5 

22 Wet 385.3, 368.3, 322.6, 
307.4, 357.9, 321.4 

37 Wet 363.5, 377.7, 327.7, 


331.9, 338.1, 394.6 


a. Identify the factors, levels, and treat- 
ments in this experiment. 

b. Create an interaction plot using the 
breaking force data, and comment on 
what you see. 

c. Test for the presence of main and 
interaction effects at the « = .05 level. 
Are the results consistent with the 
interaction plot? 


65. Students Philip Hurst and Yuan Chao Jiang 


investigated the accuracy of three different 
brands of .22 caliber ammunition at two 
different distances. The brands were Rem- 
ington, Winchester, and Federal, and the 
two designated distances were 25 and 50 
yards. “Accuracy” was measured by dis- 
tance from the bull’s-eye, in centimeters. 
All bullets were shot from the same rifle, 
and the order of the bullets was random- 
ized. The accompanying table shows the 
mean accuracy at each combination based 
on K = 75 bullets. 


Bullet brand 
Fed. Rem. Win. 


Firing 25 3.027 3.360 3.280 
distance 50 5.440 5.413 5.560 


a. Identify the factors, levels, and treat- 
ments in this experiment. 

b. Create an interaction plot for the accuracy 
data, and comment on what you see. 

c. From software, SS(Dist.) = 568.97, 
SS(Brand) = 2.97, SS(Dist.*Brand) = 
2.48, and SSE = 1041.49. Test for the 
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presence of main and interaction effects 
for the accuracy data at the « = .05 level. 
Are the results consistent with your 
interaction plot? 


66. In a study reported in the article “Can 


‘Low-Fat’ Nutrition Labels Lead to Obe- 
sity?” (J. Market. Res. 2006: 605-617), 
students and parents attending a university 
open house were offered one of two candy 
bowls: one labeled “New Colors of Regular 
M&M’s” or another labeled “New ‘Low- 
Fat? M&M’s.” (The latter product does not 
really exist; the candies were just regular 
M&M/’s.) The researchers asked each per- 
son to fill out a questionnaire (including 
height and weight information) and recor- 
ded how much candy s/he took. The two 
factors of interest are A = how the candy 
was labeled (regular, low-fat) and B = the 
person’s weight status (defined as “normal 
weight” for a body mass index below 25, 
“overweight” otherwise). The response 
variable used for the analysis was the 
amount of calories in the M&M’s taken by 
the subject. 


a. The accompanying table shows the 
average calorie consumption for each 
“treatment.” Construct an interaction 
plot, and describe what you discover. 


Subject’s weight 


Normal Overweight 
Food Regular 189 192 
label Low-fat 219 281 


b. The article includes the following 
F-values and P-values for the various 
effects (error and total df do not follow 
our formulas because the study design 
was not balanced, but this does not 
affect the interpretation of the results). 
Are the results of the F tests consistent 
with the interaction plot? Explain. 
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67. 


68. 


Source of variation df F P-value 
Food label 1 13,1 0.000 
Weight 1 43 0.039 
Interaction 1 3.9 0.049 
Error 251 

Total 254 


A study was carried out to compare the 
writing lifetimes of four premium brands of 
pens. It was thought that the writing surface 
might affect lifetime, so three different 
surfaces were randomly selected. A writing 
machine was used to ensure that conditions 
were otherwise homogeneous (e.g., con- 
stant pressure and a fixed angle). The 
accompanying table shows the two life- 
times (min) obtained for each brand—sur- 
face combination. 


Writing surface 


Brand of pen 1 2 3 

1 709,659 713,726 660, 645 
2 668,685 722,740 692, 720 
3 659, 685 666, 684 678, 750 
4 698,650 704,666 686, 733 


Carry out an appropriate ANOVA, and 
state your conclusions. 

A 2005 article in Issues in Accounting 
Education described an experiment in 
which students in an introductory account- 
ing course were randomly assigned to one 
of two computer-based learning 
(CBL) methods: one based on problem- 
solving and another using worked exam- 
ples. Students in each group were also 
classified according to prior accounting 
knowledge (yes or no). A 15-point diag- 
nostic exam was then administered to all 
students; average scores for the four groups, 
as well as a partial F table, appear below. 


CBL method Prior accounting 
knowledge? 
Yes No 
Problem-solving 10.45 759 
Worked examples 10.17 8.32 


69. 


Temp. 


11. The Analysis of Variance 


Source of variation df F P-value 
CBL Method 1 0.50 291 
Prior Knowledge 1 33.13 000 
Interaction 1 1.60 105 
Error 89 

Total 92 


a. Create an interaction plot, and comment 
on what you see. 

b. State the formal hypotheses being tested 
in the ANOVA table (there are three 
sets of hypotheses), and test each at the 
a = .05 level. Assume the conditions 
required for this inference procedure are 
met. 

c. Explain in practical terms what the tests 
in part (b) say about the effects of dif- 
ferent computer-based learning methods 
and/or prior accounting knowl- 
edge on accounting diagnostic exam 
performance. 


Several factors can impact the structural 
soundness of 3D-printed objects, including 
the “struts” that connect various pieces. The 
following data appears in the article “Ana- 
lyzing the Effects of Temperature, Nozzle- 
Bed Distance, and Their Interactions on the 
Width of Fused Deposition Modeled Struts 
Using Statistical Techniques Toward Pre- 
cision Scaffold Fabrication’(J. Manuf. Sci. 
Engr. 2017). Response values are strut 
widths, in microns. 


Nozzle-bed distance 


0.2 mm 0.3 mm 0.4 mm 


180°C 845 850 885 600 605 625 495 495 525 
200 °C 770 800 850 650 690 690 490 520 525 
220°C 900 910 995 630 645 655 510 545 560 


70. 


Perform a complete two-factor ANOVA, 
and report your findings. 

The article “Is It Really Good to Talk? 
Testing the Impact of Providing Concurrent 
Verbal Protocols on Driving Performance” 
(Ergonomics 2017: 770-779) reported an 
experiment in which 20 drivers drove four 


11.5  Two-Factor ANOVA with Replication 


71. 


72. 


73. 


laps on a fixed course. During two of the 
laps, drivers remained silent; on the other 
two, drivers were instructed to “think 
aloud” about their driving as they pro- 
ceeded along the course. Lap order was 
randomized for each driver, and each dri- 
ver’s average speed throughout the lap was 
recorded. 

With A = protocol (silent or thinking aloud) 
and B = driver, sums of squares consistent 
with information in the article include SSA 
= 6.272, SSB = 343.975, SSAB = 46.733, 
and SSE = 138.571. Protocol is a fixed 
factor, while driver is a random factor. 
Perform the appropriate two-factor 
ANOVA (complete with ANOVA table), 
testing each effect at the .05 significance 
level. Explain what each F test tells you. 


a. Show that E(X;. = x.) =4;, so that 
X;.. — X... is an unbiased estimator for «; 
in the fixed effects model. 

b. With ), = Xj. — Xj... — Xj. +X... show 
that };; is an unbiased estimator for +; in 
the fixed effects model. 


Refer back to the previous exercise. Show 
how a 100(1 — «)% t CI for «; — a can be 
obtained. Then compute a 95% interval for 
Oo — &3 using the data from Example 11.18. 
[Hint: With 0 = a — «3, the result of the 


previous exercise indicates how to obtain 0. 


Then compute V(@) and os and obtain an 


estimate of oj by using V MSE to estimate 
o, which identifies the appropriate number 
of df.] 

When both factors are random in a two-way 
ANOVA experiment with K replications 
per combination of factor levels, the 
expected mean squares are E(MSE) = 0”, 
E(MSA) = 0? + Koz, + JKo%, E(MSB) = 
o* + Koz, 4+ IKa%, E(MSAB) = 


2 2 
ao + Kag. 


and 
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a. What F ratio is appropriate for testing 
Hoc: a7. = 0 versus Hag: Cre >0? 

b. What F ratio is appropriate for testing 
Aa: a = 0 versus Hy: o% > 0? Test- 
ing Hog: ce = 0 versus Hap: on >0? 


Supplementary Exercises: (74-84) 


74. 


WD: 


Type of 


stool 


Consider the following summary data on 
the modulus of elasticity (x10° psi) for 
lumber of three different grades (in close 
agreement with values in the article 
“Bending Strength and Stiffness of Second- 
Growth Douglas-Fir Dimension Lumber” 
(For. Prod. J. 1991: 35-43), except that the 
sample sizes there were larger): 


Grade J Ri S; 
1 10 1.63 27 
2 10 1.56 24 
3 10 1.42 26 


Use this data and a significance level of .01 
to test the null hypothesis of no difference 
in mean modulus of elasticity for the three 
grades. 

The article “The Effects of a Pneumatic 
Stool and a One-Legged Stool on Lower 
Limb Joint Load and Muscular Activity 
During Sitting and Rising” (Ergonomics 
1993: 519-535) gives the accompanying 
data on the effort required of a subject to 
arise from four different types of stools 
(Borg scale). Perform an analysis of vari- 
ance using « = .05, and follow this with a 


multiple comparisons analysis if 
appropriate. 

Subject 
123 45 678 9 &, 
11210 7 7 8 9 8 7 9 8.56 
2 15 14 14 11 11 11 12 11 13 12.44 
3 12 13 13 10 8 11 12 8 10 10.78 
41012 9 9 710 11 7 8 9.22 
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77. 


78. 


The article “Antimicrobial Activities of 
Essential Oil of Eight Plant Species from 
Different Families Against Some Patho- 
genic Microorganisms” (Res. J. Microbiol. 
2016: 28-34) reported on an experiment in 
which various concentrations of eight 
essential oils were applied to active bacte- 
rial cultures. The accompanying table 
shows the inhibition percentage of E. coli 
for each oil-concentration combination. 


Concentration (uL/mL) 


Oil Type 2 4 6 8 10 
Ginger 60 72 80 92 99 
Thyme 58 64 78 86 96 
Coriander 57 63 81 89 99 
Marjoram 17 34 49 51 67 
Mustard 14 31 45 63 70 
Chamomile 22 36 54 63 72 
Licorice 10 14 23 28 33 
Nigella 15 29 42 48 57 


a. Perform a two-factor ANOVA, testing 
both main effects at the .01 level. 

b. Apply Tukey’s method to the eight 
essential oils, and describe what you find. 


An experiment was carried out to compare 

flow rates for four different types of nozzle. 

a. Sample sizes were 5, 6, 7, and 6, 
respectively, and calculations gave 
f= 3.68. State and test the relevant 
hypotheses using « = .O1. 

b. Analysis of the data using a statistical 
computer package yielded P-value = 
.029. At level .01, what would you 
conclude, and why? 


The article “Towards Improving the Prop- 
erties of Plaster Moulds and Castings” 
(J. Engr. Manuf. 1991: 265-269) describes 
several ANOVAs carried out to study 
how the amount of carbon fiber and 
sand additions affect various characteristics 
of the molding process. Here we give data 
on casting hardness and on wet-mold 
strength. 


11. The Analysis of Variance 


Sand Carbon Casting Wet-mold 
addition fiber hardness strength 
(%) addition 
(%) 

0 0 61.0 34.0 

0 0 63.0 16.0 
15 0 67.0 36.0 
15 0 69.0 19.0 
30 0 65.0 28.0 
30 0 74.0 17.0 

0 29 69.0 49.0 

0 25 69.0 48.0 
15 25 69.0 43.0 
15 .25 74.0 29.0 
30 .25 74.0 31.0 
30 .25 72.0 24.0 

0 50 67.0 55.0 

0 50 69.0 60.0 
15 50 69.0 45.0 
15 50 74.0 43.0 
30 50 74.0 22.0 
30 50 74.0 48.0 


a. An ANOVA for wet-mold strength 
gives SS(Sand) = 705, SS(Fiber) = 
1278, SSE = 843, and SST = 3105. 
Test for the presence of any effects 
using « = .05. 

b. Carry out an ANOVA on the casting 
hardness observations using « = .05. 

c. Construct an interaction plot with sand 
percentage on the horizontal axis, and 
discuss the results of part (b) in terms of 
what the plot shows. 


79. The article “The Effectiveness of Virtual 


and Augmented Reality in Health Science 
and Medical Anatomy” (Anatom. Sc. Educ. 
2017: 549-559) describes an experiment in 
which 59 health science students received 
an identical, electronic 10-minute lesson on 
skull anatomy (complete with 3D graphics 
models) through one of three devices: a 
virtual reality (VR) system, an augmented 
reality (AR) system, or a 3D-capable tablet. 
After the lesson, all students took a 
20-question test on skull anatomy. The 
accompanying table summarizes the stu- 
dents’ exam scores. 


80. 


Supplementary Exercises 


Lesson delivery n Mean SD 
VR 20 12.9 4.3 
AR 17 12.5 4.5 
3D Tablet 22 13.3 4.2 


Does the data suggest that there is a dif- 
ference among the three lesson delivery 
methods with respect to true mean exam 
score? Use « = .05. 

Numerous factors contribute to the smooth 
running of an electric motor (“Increasing 
Market Share Through Improved Product 
and Process Design: An Experimental 
Approach,” Qual. Engr. 1991: 361-369). 
In particular, it is desirable to keep motor 
noise and vibration to a minimum. To study 
the effect that the brand of bearing has on 
motor vibration, five different motor bear- 
ing brands were examined by installing 
each type of bearing on different random 
samples of six motors. The amount of 
motor vibration (measured in microns) was 
recorded when each of the 30 motors was 
running. The data for this study follows. 
State and test the relevant hypotheses at 
significance level .05, and then carry out a 


multiple comparisons analysis if 
appropriate. 
Mean 
Brand 1: 13.1 15.0 140 144 140 11.6 13.68 
Brand 2: 16.3 15.7 17.2 149 144 17.2. 15.95 
Brand 3: 13.7 13.9 12.4 13.8 149 13.3 13.67 
Brand 4: 15.7 13.7 144 160 13.9 14.7. 14.73 
Brand 5: 13.5 134 13.2 12.7 134 12.3 13.08 


81. 


An article in the British scientific journal 
Nature reported on an experiment in which 
each of five groups consisting of six rats 
was put on a diet with a different carbo- 
hydrate. At the conclusion of the experi- 
ment, the DNA content of the liver of each 
rat was determined (mg/g liver), with the 
following results: 


Carbohydrate Hy. 

Starch 2.58 
Sucrose 2.63 
Fructose 2.13 


(continued) 


82. 


83. 
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Carbohydrate Xp 
Glucose 2.41 
Maltose 2.49 


a. Assuming also that SST = 3.62, is the 
true average DNA content affected by 
the type of carbohydrate in the diet? 
Construct an ANOVA table and use a 
.05 level of significance. 

b. Construct a ¢ CI for the contrast 


0 = my — (Hy + 3 + Hg + Us) /4 


which measures the difference between 
the average DNA content for the starch 
diet and the combined average for the 
four other diets. Does the resulting 
interval include zero? 

c. What is f for the test when true average 
DNA content is identical for three of the 
diets and falls below this common value 
by | standard deviation (c) for the other 
two diets? 


Four laboratories (1-4) are randomly 
selected from a large population, and each 
is asked to make three determinations of the 
percentage of methyl alcohol in specimens 
of a compound taken from a single batch. 
Based on the accompanying data, are 
differences among laboratories a source of 
variation in the percentage of methyl alco- 
hol? State and test the relevant hypotheses 
using significance level .05. 


1: 85.06 85.25 84.87 
2 84.99 84.28 84.88 
3: 84.48 84.72 85.10 
4: 84.10 84.55 84.05 


The article “Effects of Pulmonary Rehabil- 
itation on Exercise Capacity and Disease 
Impact in Patients with Chronic Obstructive 
Pulmonary Disease and Obesity” (Physio- 
therapy 2018: 248-250) reports a study in 
which 155 COPD sufferers completed an 
eight-week pulmonary rehabilitation pro- 
gram at St. James’ Hospital in Dublin, 
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Ireland. Before and after the program, 
subjects performed the Six-Minute Walk 
Test (6MWT), which simply measures the 
distance (in meters) patients can walk in six 
minutes. The accompanying table summa- 
rizes the increase in 6MWT distance for the 
participants (i.e., post-rehab distance minus 
pre-rehab distance), separated by their 
weight category. 


Weight category n Mean SD 
Underweight/normal 53: 61 80 
Overweight 39 67 86 
Obese 63 41 87 


a. Does the data suggest that the pul- 
monary rehab program is not equally 
effective for COPD patients of all weight 
categories? State and test the relevant 
hypotheses at significance level .05. 

b. Investigate differences between weight 
categories with respect to mean increase 
in 6MWT distance. 


11. The Analysis of Variance 


84. Recall from Section 11.2 thatific),c2,...,¢7 


are numbers satisfying Yic =0 
then 0 = >> cjp; is called a contrast in the 
Ls. Notice that with cy; =1, c.=-l, 
C3 = = cy, = 0, op; = My — Wy, which 
implies that every pairwise difference 
between p;’s is a contrast (and so is, e.g., 
by — Sho — .5f3). A method attributed to 
Scheffé gives simultaneous CIs with simul- 
taneous confidence level 100(1 — «)% for all 
possible contrasts (an infinite number of 
them!). The interval for 5> cj; is 


So iki. + vu = 1)Fa,t—1n-1MSE 9 © c?/J; 


Using the data from the previous exercise, 
calculate the 95% confidence Scheffé 
intervals for the contrasts My — Lo, Ly — Ls, 
Ho — Hs, and Spy + Spo — Ms (the last con- 
trast compares obese patients to the average 
of normal and overweight). Which contrasts 
differ significantly from 0, and why? 


®) 


Check for 
updates 


Introduction 

The general objective of a regression analysis is to investigate the relationship between two (or more) 
variables so that we can gain information about one of them through knowing values of the other(s). 
Much of mathematics is devoted to studying variables that are deterministically related, meaning that 
once we are told the value of x, the value of y is completely specified. For example, suppose we 
decide to rent a van for a day and that the rental cost is $25.00 plus $.30 per mile driven. Letting 
x = the number of miles driven and y = the rental charge, then y = 25 + .3x. If the van is driven 100 
miles (x = 100), then y = 25 + .3(100) = $55. As another example, suppose the initial velocity of a 
particle is vo and it undergoes constant acceleration a. Then distance traveled = y = vox + Lax’, where 
x = time. 

There are many variables x and y that would appear to be related to each other, but not in a 
deterministic fashion. A familiar example to many students is given by variables x = high school 
grade point average (GPA) and y = college GPA. The value of y cannot be determined completely 
from knowledge of x, as two different students could have the same x value but very different 
y values. Yet there is a tendency for those students who have high (low) high school GPAs also to 
have high (low) college GPAs. Knowledge of a student’s high school GPA should help us predict 
how that person will do in college. Other examples of variables related in a nondeterministic fashion 
include x = applied tensile force and y = amount of elongation in a metal strip, x = age of a child and 
y = size of that child’s vocabulary, and x = size of an engine and y = fuel efficiency for an auto- 
mobile equipped with that engine. 

In this chapter, we generalize a deterministic linear relationship to obtain a probabilistic linear 
model for relating two variables x and y. We then develop procedures for making inferences based on 
data obtained from the model and obtain a quantitative measure (the correlation coefficient) of the 
extent to which the two variables are related. Techniques for assessing the adequacy of any particular 
regression model are then considered. Multiple regression analysis is introduced next as a way of 
relating y to two or more variables—for example, relating fuel efficiency of an automobile to weight, 
engine size, number of cylinders, and transmission type. The penultimate section of the chapter shows 
how matrix algebra techniques can be used to facilitate a concise and elegant development of 
regression procedures. The final section explains logistic regression, a method devised to predict a 
categorical variable y (e.g., absence or presence of lung cancer) from one or more x variables (amount 
of nicotine smoked/vaped per day, age, and so on). 
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12.1 The Simple Linear Regression Model 


The key idea in developing a probabilistic relationship between a response (or dependent) variable 
y and an explanatory (or predictor or independent) variable x is to realize that once the value of 
x has been fixed, there is still uncertainty in what the resulting y value will be. That is, for a fixed 
value of x, we think of the response variable as being random. This random variable will be denoted 
by Y and its observed value by y. For example, suppose an investigator plans a study to relate 
y = yearly energy usage of an industrial building (1000’s of BTUs) to x = the shell area of the 
building (ft*). If one of the buildings selected for the study has a shell area of 25,000 ft’, the resulting 
energy usage might be 2,215,000 or 2,348,000 or any one of a number of other possibilities. Since we 
don’t know a priori what the value of energy usage will be—because usage is determined partly by 
factors other than shell area—usage is regarded as a random variable Y. 
We typically relate the explanatory and response variables by an additive model equation: 


Y = (some particular deterministic function of x) + (arandom deviation) 


ee (12.1) 
The symbol ¢ represents a random deviation or random “error” (i.e., a random variable), which is 
assumed to have mean value 0. This rv incorporates all variation in the response variable due to 
factors other than x. Without the random deviation ¢, whenever x is fixed prior to making an 
observation on the response variable, the resulting (x, y) point would fall exactly on the graph of y = 
J(x), i.e., y would be entirely determined by x. The role of the random deviation ¢ is to allow a 
nondeterministic relationship. The assumption that ¢ has mean value 0 implies that, at any fixed 
x value, the mean (or expected) Y value is given by the function f(x). In other words, we regard f(x) in 
(12.1) as the mean response for a given x value. 

How should the deterministic part of the model equation be selected? Occasionally some sort of 
theoretical argument will suggest an appropriate choice of f(x). However, in practice the specification 
of f(x) is almost always made by obtaining sample data consisting of (x, y) pairs. A picture of the 
resulting observations (x1, y;), (%2, Y2), ---, An» Yn), called a scatterplot, is then constructed. In this 
scatterplot each (x;, y;) is represented as a point in a two-dimensional coordinate system. The pattern 
of points in the plot should suggest an appropriate f(x). 


Example 12.1 Troops deployed in active conflict areas worldwide depend on their body armor for 
protection. In conjunction with the US Army, the National Research Council developed the 2012 
report “Testing of Body Armor Materials—Phase III.” In one test, specimens of UHMWPE body 
armor were shot with a 7.62 mm round at different firing velocities. The accompanying data on 
x = velocity (m/s) and y = penetration area (mm7, a proxy for amount of damage) appears in a graph 
in the report. 


i 1 2 3 4 5 6 7 8 9 10 
Xj 670 675 679 681 694 699 699 708 726 732 
Yi 66.4 64.5 63.6 72.9 79.1 76.7 65.5 68.0 57.8 72.4 
i 11 12 13 14 15 16 17 18 19 20 
Xj 738 740 762 762 768 780 792 786 790 787 


Yi 78.6 87.9 92.6 83.0 79.0 75.3 83.4 100.7 106.6 112.8 
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Thus (x1, y;) = (670, 66.4), (x2, y2) = (675, 64.5), and so on. A scatterplot is shown in Figure 12.1. 
Here are some things to notice about the data and plot: 


e Several observations have identical x values yet different y values (e.g., x6 = x7 = 699, but 
Ye = 76.7 and y7 = 65.5). Thus x and y are not deterministically related. 

e There is a strong tendency for y to increase as x increases. That is, higher firing velocities tend, not 
surprisingly, to be associated with larger penetration areas—a positive relationship between the 
variables. 

e It appears that the value of y could be predicted from x by finding a line that “cuts through the 
heart” of the points in the plot; in fact, the authors of the report superimposed such a line on their 
plot. In other words, there is evidence of a substantial, though certainly not perfect, linear rela- 
tionship between the two variables. 


Penetration area (mm”) 


605 


50 yy or or oT” > Velocity (m/sec) 
660 680 700 720 740 760 780 800 


Figure 12.1 Scatterplot for the data from Example 12.1 a 


Notice that the axes in Figure 12.1 do not meet at (0, 0); rather, the lower-left corner is roughly at 
(50, 660). In most data sets, the values of x and/or y differ considerably from zero, and it makes better 
visual sense to adjust the axis boundaries to reflect the ranges of the variables. 


Example 12.2 As demand for renewable energy such as solar and wind power increases, companies 
are spending more research money to develop more efficient methods for producing such energy. The 
scatterplot in Figure 12.2 shows the efficiency of a solar cell (y, measured as a percentage of the 
theoretical maximum efficiency) and the “sheet resistance” of the cell (x, measured in ohms) for a 
random sample of 132 prototype solar cells manufactured by a certain energy company. (Data 
provided by John Coleman; efficiency in the 8-15% range may seem low, but these were typical 
values in the solar energy industry at the time the data was collected.) 
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Efficiency 


Sheet resistance (ohms) 
40 50 60 70 80 90 


Figure 12.2 Scatterplot for the data from Example 12.2 


As in the previous example, Figure 12.2 suggests a probabilistic relationship between the two 
variables: it appears that two solar cells with the same sheet resistance will not necessarily have the 
same efficiency. But the curvature in the scatterplot implies that a nonlinear relationship exists 
between x and y. A quadratic function f(x) would be more appropriate here if we wished to apply the 
model equation (12.1) to this scenario. Ei) 


Throughout the next several sections, we will concentrate on situations for which a linear rela- 
tionship, such as in Example 12.1, is reasonable. Quadratic and other more sophisticated models to 
accommodate data such as Example 12.2 are considered in Section 12.8. 


A Linear Probabilistic Model 

For the deterministic linear relationship y = fo + B,x, the slope coefficient /, is the guaranteed 
increase in y when x increases by one unit, and the intercept coefficient / is the value of y when 
x = 0. When a scatterplot of bivariate data consisting of (x;, y;) pairs shows a reasonably substantial 
linear pattern, it is natural to specify f(x) in the model equation (12.1) to be a linear function. Rather 
than assuming that the response variable itself is a linear function of x, the model assumes that the 
expected value of Y is a linear function of x. For each data point, the observed value of Y will deviate 
by a random amount from its expected value. 


THE SIMPLE LINEAR _ There are parameters fo, /,, and o such that for any fixed value of the 
REGRESSION MODEL explanatory variable x, the response variable is related to x through the 
model equation 


Y=fot+Bixt+e 


Moreover, regardless of the fixed x value, the random variable « is 
assumed to follow a M(O, o) distribution. 


The term “simple” here refers to the use of a single explanatory variable; in Section 12.7, we will 
consider models with multiple x variables. The n observed pairs (x, y1), (%2, Y2), ---> Ons Yn) are 
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regarded as having been generated independently of one another from the model equation: first fix 
x = x, and observe Y; = Bo + Bix; + &1, then fix x = x2 and observe Y> = fo + fix. + &, and so on. 
Assuming that the ¢;’s are independent of each other implies that the Y;’s are also. 

Figure 12.3 gives an illustration of data resulting from the simple linear regression model. 


) ae 
(«,¥,) True regression line 


Ax) = By + Bx 


Figure 12.3 Points corresponding to observations from the simple linear regression model 


The first two model parameters fo and f, are the coefficients of the population (or true) re- 
gression line f(x) = fo + fix. The slope parameter /, is now interpreted as the change in the 
expectation of Y associated with a one-unit increase in x. As an example, if x = size of a house (sq. ft.), 
y = amount of natural gas used (therms) during a specified period, and f; = .017, then the change in 
expected gas usage associated with a one-sq-ft increase in house size is .017 therms. The standard 
deviation parameter o controls the inherent amount of variability in the data. When oa is very close 
to 0, virtually all of the (x;, y,;) pairs in the sample should correspond to points quite close to the 
population regression line. But if o is relatively large, a number of points in the scatterplot are likely 
to fall far from the line. Roughly speaking, the magnitude of o is the size of a “typical” deviation from 
the population line. 

The following notation will help clarify implications of the model relationship. Let x* denote a 
particular value of the explanatory variable x, and 


My|y: = E(¥|x") = the expected(i-e., mean)value of Y when x = x" 


OF ys = V(Y|x*) = the variance of Y when x = x* 


For example, if x = applied stress (kg/mm”) and y = time to fracture (h), then /1yj99 denotes the 
expected time to fracture when applied stress is 20 kg/mm’. If we conceptualize an entire population 
of (x, y) pairs resulting from applying stress to specimens, then {ty)79 1s the average of all values of the 


response variable for which x = 20. The variance FF 00 describes the spread in the distribution of all 


y values for which applied stress is 20. 

When the value x = x* is fixed, the only randomness on the right-hand side of the model equation 
is from the random deviation ¢. Recalling that the mean value of a numerical constant is itself and that 
adding a constant does not affect variance, we have that 


My|e = E(Bo + Bix" + &) = Bo + Bix” + E(e) = Bot Bix” 
Oy = V(Bo as Bix Tr é) = V(e) —4 oe 
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The first sequence of equalities says that the mean value of Y when x = x* is the height of the 
population regression line above the value x*. That is, the population regression line is the line of 
mean Y values—the mean response is a linear function of the explanatory variable. The second 
sequence of equalities tells us that the amount of variability in the distribution of Y is the same at 
every x value—this “constant variance” assumption is part of the simple linear regression model. 

The constant variance property implies that points should spread out about the population 
regression line to the same extent throughout the range of x values in the sample, rather than fanning 
out more as x increases or as x decreases. If x = age of a preschool child and Y = the child’s vocabulary 
size, data suggests that mean vocabulary size increases linearly with age. However, there is more 
variability in vocabulary size for four-year-olds than for two-year-olds, so there is not constant vari- 
ation in Y about the population line, and the simple linear regression model is therefore not appropriate. 
In Section 12.6, we will briefly discuss possible remedies to this assumption violation. 

Finally, the sum of a constant and a normally distributed variable is itself normally distributed, and 
the addition of the constant affects only the mean value and not the standard deviation. So for any 
fixed value x*, Y (= Bo + f\x* + €) has a normal distribution. The foregoing properties are summa- 
rized in Figure 12.4. 


a fr Normal, mean 0, 
standard deviation 

b 3 

B, ci BX; 

B+ BX, 

B+ Bx 


Line E(Y|x) =, + Bx 


1 2 “3 


Figure 12.4 (a) Distribution of ¢, (b) distribution of Y for different values of x 


Example 12.3 Suppose the relationship between applied stress x and time to fracture y is described by 
the simple linear regression model with By = 65, 8; = —1.2, anda = 8. Then there is a 1.2-h decrease in 
average (or expected) fracture time associated with an increase of 1 kg/mm? in applied stress. For any 
fixed value of x* of stress, time to fracture is normally distributed with mean value 65 — 1.2x* and 
standard deviation 8. Roughly speaking, in the population consisting of all (x, y) points, the magnitude of 
a typical deviation from the true regression line is about 8. 

For x = 20, Y has mean value fly9 = 65 — 1.2(20) = 41, so 


50-41 
P(Y > 50whenx = 20) = (z > aon = 1— (1.13) = .1292 
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When applied stress is 25, jty25 = 35, so the probability that time to fracture exceeds 50 is 


P(Y > 50whenx = 25) = P(z > “) = 1 — 0(1.88) = .0301 


These probabilities are illustrated as the shaded areas in Figure 12.5. 


y P(Y¥> 50 when x = 20) = .1292 


| 
| | P(Y> 50 when x = 25) =.0301 


True regression line 
E(¥|x) = 65 — 1.2x 


20 25 


Figure 12.5 Probabilities based on the simple linear regression model 


Suppose that Y,; denotes an observation on time to fracture made with x = 25 and Y, denotes an 
independent observation made with x = 24. Then the difference Y; — Y> is normally distributed with 
mean value E(Y; — Y2) = B; = —1.2, variance V(Y; — Y2) = C+o= 128, and standard deviation 
128 = 11.314. The probability that Y, exceeds Y> is 
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P(Y, —¥2 >0) =P(Z 
Cee’ (z > 11.314 


) = P(Z>.11) = 4562 


That is, even though we expect Y to decrease when x increases by one unit, the probability is fairly 
high (but less than .5) that the observed Y at x + 1 will be larger than the observed Y at x. a 


Our discussion thus far has presumed that the explanatory variable is under the control of the 
investigator, so that only the response variable Y is random. This will not always be the case: if we 
take a random sample of college students and record the height and weight of each, neither variable is 
preselected, so both x and y could be considered random. Methods and conclusions of the next several 
sections can be applied both when the values of the explanatory variable are fixed in advance and 
when they are random, but because the derivations and interpretations are more straightforward in the 
former case, we will continue to work explicitly with it. For more commentary, see the excellent book 
by Michael Kutner et al. listed in the bibliography. 
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Exercises: Section 12.1 (1-12) 


1. Obesity is associated with higher foot load 


that can potentially increase pain and dis- 
comfort, but little research has been done 
on this relationship in children and _ its 
possible effects. A graph in the article 
“Childhood Obesity is Associated with 
Altered Plantar Pressure Distribution Dur- 
ing Running” (Gait and Posture 2018: 
202-205) gave the accompanying data on 
x = body mass index (kg/m?) and y = peak 
foot pressure (kPa) while running for a 
sample of 42 children. 


12.8 13.0 13.0 13.5 13.8 13.8 142 144 14.5 
340 346 641 572 360 334 366 538 360 
146 146 148 149 15.0 15.0 15.0 15.5 15.6 
417 627 609 552 414 575 578 546 314 
15.9 166 16.9 17.0 17.1 17.1 175 17.6 18.6 
466 572 494 454 305 368 494 322 494 
18.7 18.7 20.0 20.1 20.5 20.6 21.0 21.1 21.2 
589 305 664 368 362 474 486 351 382 
21.6 224 23.1 24.2 24.7 26.5 

491 893 741 850 815 376 


a. Construct stem-and-leaf displays of 
both BMI and peak foot pressure, and 
comment on interesting features. 

b. Is the value of peak foot pressure 
completely and uniquely determined by 
BMI? Explain your reasoning. 

c. Construct a scatterplot of the data. 
Does it appear that peak foot pressure 
could be predicted by a child’s body 
mass index? Explain your reasoning. 


Verapamil is used to treat certain heart 
conditions, including high blood pressure 
and arrhythmia. Studies continue on the 
factors that affect the drug’s absorption into 
the body. The article “The Effect of Non- 
ionic Surfactant Brij 35 on Solubility and 
Acid-Base Equilibria of Verapamil” 
(J. Chem. Engr. Data 2017: 1776-1781) 
includes the following data on x = pH and 
y = Verapamil solubility (10° mol/L) at 
25 °C for one such study. 


8.12 8.32 841 8.62 8.70 8.84 8.88 9.09 


22.2 146 13.9 8.76 5.06 5.57 
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a. Construct a scatterplot of solubility 
versus pH, and describe what you see. 
Does it appear that a linear model 
would be appropriate here? 

b. Hydrogen ion concentration [H*] is 
related to pH by pH = -log,¢({H*)). 
Use this to calculate the hydrogen ion 
concentrations for each observation, 
then make a scatterplot of solubility 
versus [H*]. Does it appear that a linear 
model would fit this data well? 

c. Would a linear function fit the data in 
part (b) perfectly? That is, is it rea- 
sonable to assume a completely deter- 
ministic relationship here? Explain 
your reasoning. 


3. Bivariate data often arises from the use of 


two different techniques to measure the 
same quantity. As an example, the accom- 
panying observations on x = hydrogen 
concentration (ppm) using a gas chro- 
matography method and y = concentration 
using a new sensor method were read from 
a graph in the article “A New Method to 
Measure the Diffusible Hydrogen Content 
in Steel Weldments Using a Polymer 
Electrolyte-Based Hydrogen Sensor” 
(Welding Res., July 1997: 251s—256s). 


47 62 65 70 70 78 95 100 114 118 
38 62 53 67 84 79 93 106 117 116 
124 127 140 140 140 150 152 164 198 221 
127 114 «134 139 142 170 149 154 200 215 


Construct a scatterplot. Does there appear 
to be a very strong relationship between the 
two types of concentration measurements? 
Do the two methods appear to be measuring 
roughly the same quantity? Explain your 
reasoning. 


. A study to assess the capability of subsur- 


face flow wetland systems to remove bio- 
chemical oxygen demand (BOD, a measure 
of organic matter in sewage) and various 
other chemical constituents resulted in the 
accompanying data on x=BOD mass 
loading (kg/ha/d) and y=BOD mass 
removal (kg/ha/d) (“Subsurface Flow 
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Wetlands—A Performance Evaluation,” 
Water Environ. Res. 1995: 244-247). 


a. Construct boxplots of both mass load- 
ing and mass removal, and comment on 
any interesting features. 

b. Construct a scatterplot of the data, and 
comment on any interesting features. 


. The article “Objective Measurement of the 


Stretchability of Mozzarella Cheese” 
(J. Texture Stud. 1992: 185-194) reported 
on an experiment to investigate how the 
behavior of mozzarella cheese varied with 
temperature. Consider the accompanying 
data on x = temperature and y = elongation 
(%) at failure of the cheese. [Note: The 
researchers were Italian and used real 
mozzarella cheese, not the poor cousin 
widely available in the USA.] 


59 63 68 72 74 78 83 


a. Construct a scatterplot in which the 
axes intersect at (0, 0). Mark 0, 20, 40, 
60, 80, and 100 on the horizontal axis 
and 0, 50, 100, 150, 200, and 250 on 
the vertical axis. 

b. Construct a scatterplot in which the 
axes intersect at (55, 100), as was done 
in the cited article. Does this plot seem 
preferable to the one in part (a)? 
Explain your reasoning. 

c. What do the plots of parts (a) and 
(b) suggest about the nature of the 
relationship between the two variables? 


. One factor in the development of tennis 


elbow, a malady that strikes fear in the 
hearts of all serious tennis players, is the 
impact-induced vibration of the racket-and- 
arm system at ball contact. It is well known 
that the likelihood of getting tennis elbow 
depends on various properties of the racket 
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used. Consider the scatterplot of x = racket 
resonance frequency (Hz) and y = sum of 
peak-to-peak acceleration (a characteristic 
of arm vibration, in m/s/s) for n = 23 dif- 
ferent rackets (“Transfer of Tennis Racket 
Vibrations into the Human Forearm,” Med. 
Sci. Sports Exercise 1992: 1134-1140). 
Discuss interesting features of the data and 
scatterplot. 


24 ° 


22 | | | | | | | | Lyx 
100 110 120 130 140 150 160 170 180 190 


. Data from the EPA’s Fuel Efficiency Guide 


suggests an approximate linear relationship 

between y=highway fuel efficiency 

(mpg) and x= weight (lbs) for midsize 

cars. Suppose the equation of the true 

regression line is f(x) = 70 — .0085x. 

a. What is the expected value of highway 
fuel efficiency when weight = 2500 
Ibs? 

b. By how much can we expect highway 
fuel efficiency to change when weight 
increases by | Ib? 

c. Answer part (b) for an increase of 500 
Ibs. 

d. Answer part (b) for a decrease of 500 
Ibs. 


. Referring to the previous exercise, suppose 


that the random deviation ¢ is normally 
distributed with standard deviation 4.6 mpg. 


a. What is the probability that the 
observed value of highway fuel effi- 
ciency will exceed 30 mpg when the 
car’s weight is 4000 lbs? 

b. Repeat part (a) with 5000 in place of 
4000. 
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9. 


10. 


c. Consider making two independent 
observations on highway fuel efficiency, 
the first for a car weighing 4000 Ibs and 
the second for x = 5000. What is the 
probability that the first observation will 
exceed the second by more than 5 mpg? 

d. Let Y; and Y> denote observations on 
highway fuel efficiency when x = x, 
and x = x2, respectively. By how much 
would xz have to exceed x, in order that 
P(Y, > Yo) = .95? 

The flow rate y (m?/min) in a device used 

for air-quality measurement depends on the 

pressure drop x (in. of water) across the 
device’s filter. Suppose that for x values 
between 5 and 20, the two variables are 
related according to the simple linear 
regression model with true regression line 

E(Y|x) = —.12 + .095x. 

a. What is the expected change in flow 
rate associated with a 1-in. increase in 
pressure drop? Explain. 

b. What change in flow rate can be 
expected when pressure drop decreases 
by 5 in.? 

c. What is the expected flow rate for a 
pressure drop of 10 in.? A drop of 15 
in.? 

d. Suppose o=.025 and consider a 
pressure drop of 10 in. What is the 
probability that the observed value of 
flow rate will exceed .835? That 
observed flow rate will exceed .840? 

e. What is the probability that an obser- 
vation on flow rate when pressure drop 
is 10 in. will exceed an observation on 
flow rate made when pressure drop is 
11 in.? 

Suppose the expected cost of a production 

run is related to the size of the run by the 

equation E(Y|x) = 4000 + 10x. Let Y de- 
note an observation on the cost of a run. 

Assuming that the variables size and cost 

are related according to the simple linear 

regression model, could it be the case that 


11. 


12. 
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P(Y>5500 when x=100)=.05 and 
P(Y > 6500 when x = 200) = .10? Explain. 
Suppose that in a certain chemical process 
the reaction time y (hr) is related to the 
temperature (°F) in the chamber in which 
the reaction takes place according to the 
simple linear regression model with equa- 
tion E(Y|x) = 5.00 —.01x and o = .075. 


a. What is the expected change in reaction 
time for a | °F increase in temperature? 
For a 10 °F increase in temperature? 

b. What is the expected reaction time 
when temperature is 200 °F? When 
temperature is 250 °F? 

c. Suppose five observations are made 
independently on reaction time, each 
one for a temperature of 250 °F. What 
is the probability that all five times are 
between 2.4 and 2.6 h? 

d. What is the probability that two inde- 
pendently observed reaction times for 
temperatures 1° apart are such that the 
time at the higher temperature exceeds 
the time at the lower temperature? 


The article “On the Theoretical Velocity 

Distribution and Flow Resistance in Natural 

Channels” (J. Hydrol. 2017: 777-785) 

suggests a quadratic relationship between 

x = flow depth (m) and y = water surface 

slope at certain points along the Tiber River 

in Italy. Suppose the variables are related 

by Equation (12.1) with f(x) = —0.6x7 + 

5x + 1 (similar to the equation suggested in 

the article). 

a. What is the expected water surface 
slope when the flow depth is 2.0 m? 
2.5 m? 3.0 m? 

b. Does the expected water surface slope 
change by a fixed amount for each 1-m 
increase in flow depth? Explain. 

c. Determine a flow depth for which the 
expected surface slope is the same as 
the expectation for x = 2.0 m obtained 
in part (a). [Note: Your answer should 
be something other than 2.0.] 
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d. For what depth is expected water sur- article. At a flow depth of 3.0 m, 
face slope maximized? what’s the probability the water surface 
e. Assume the rv ¢ in Equation (12.1) has slope Y is greater than 10? Less than 6? 


a standard normal distribution; this is 
consistent with information in the 
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We will assume in this and the next several sections that the variables x and y are related according to 
the simple linear regression model. The values of the parameters fo, 61, and o will almost never be 
known to an investigator. Instead, sample data consisting of n observed pairs (x1, y1), ..-, (Xp, Yn) Will 
be available, from which the model parameters and the true regression line itself can be estimated. 
These observations are assumed to have been obtained independently of each other. 

According to the model, the observed points will be distributed about the true regression line 
f(x) = By + Bx in a random manner. Figure 12.6 shows a scatterplot of observed pairs along with 
two candidates for the estimated regression line, y = a9 + a,x and y = bo + b,x. Intuitively, the line 
y = ay + a,x is not a reasonable estimate of the true line because, if y = ag + a,x were the true line, 
the observed points would almost surely have been closer to this line. The line y = bo + b,x is a more 
plausible estimate because the observed points are scattered rather closely about this line. 


y 


y=byt bx 


pre 
y= + a,x 


xX 


Figure 12.6 Two different estimates of the true regression line: one good and one bad 


Figure 12.6 and the foregoing discussion suggest that our estimate of B) + 6.x should be a line that 
provides, in some sense, a “best fit” to the observed data points. This is what motivates the principle 
of least squares, which can be traced back to the mathematicians Gauss and Legendre around the 
year 1800. According to this principle, a line provides a good fit to the data if the vertical distances or 
deviations from the observed points to the line (see Figure 12.7) are small. The proposed measure of 
the goodness-of-fit is the sum of the squares of these deviations; the best-fit line is then the one having 
the smallest possible sum of squared deviations. 
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yah, t+bx 


Time to fracture (hr) 


10 20 30 40 


Applied stress (kg/mm?) 


Figure 12.7 Deviations of observed data from line y = by + byx 


PRINCIPLE OF LEAST The vertical deviation of the point (x;, y;) from a line y = bp + bx is 
SQUARES 


height of point — height of line = y; — (bo + b1x;) 
The sum of squared vertical deviations from the points (x, y), ..., 


(Xn, Yn) to the line is then 


n 


g(bo,b1) = 9° bi — (bo + bixi)]? 


i=l 


The point estimates of Bo and f,, denoted by Bo and B , and called the 
least squares estimates, are those values that minimize g(bo,b,). The 
estimated regression line or least squares regression line (LSRL) is 


then the line whose equation is y =f + Bx. 


The minimizing values of bo and b, are found by taking partial derivatives of g(bo, b,) with respect to 
both bo and b,, equating them both to zero, and solving the equations 


e(b 
“Boot = S© 201 — bo — bixi)(-1) = 0 


tn 


=> 26 2(y = bo = bi x;) (—x;) => 0 


Cancellation of the factor 2 and re-arrangement gives the following system of equations, called the 


normal equations: 
nbo + os x))bi = So yi 


(~ x) bo + (~ x7)by = So xi 
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The normal equations are linear in the two unknowns Dp and b,. Provided that at least two of the x;’s 
are different, the least squares estimates are the unique solution to this linear system. 


PROPOSITION The least squares estimate of the slope coefficient /, of the true regression line is 


b= B, = di = *)0% — ¥) (12.2) 


The least squares estimate of the intercept fo of the true regression line is 


R _Yw-b dom 


bo = Bo =7- Bx (12.3) 


Moreover, under the normality assumption of the simple linear regression model, 


Bo and B , are also the maximum likelihood estimates (see Exercise 23). 


Because they will feature prominently here and in subsequent sections, we define the following 
notation for certain sums: 


The S,,, formula was presented in Chapter | in connection with the sample variance: s? = S,,/(n — 1) 
and similarly for y. The least squares estimates of the regression coefficients can then be written as 


We emphasize that before B ; and Bo are computed, a scatterplot should be examined to see whether a 
linear probabilistic model is plausible. If the points do not tend to cluster about a straight line with 
roughly the same degree of spread for all x (e.g., Figure 12.2), then other models should be investigated. 


Example 12.4 As brick-and-mortar shops decline and online retailers like Amazon and Wayfair 
ascend, demand for warehouse storage space has steadily increased. Despite effectively being empty 
shells, warehouses still require professional appraisal. The following data on x = truss height (ft), 
which determines how high stored goods can be stacked, and y = sale price ($) per square foot 
appeared in the article “Challenges in Appraising ‘Simple’ Warehouse Properties” (The Appraisal J. 
2001: 174-178). 


Warehouse 1 2 3 4 5 6 7 8 9 10 


x 12 14 14 15 15 16 18 22 22 24 
y 35.53 37.82 36.90 40.00 38.00 37.50 41.00 48.50 47.00 47.50 


Warehouse 11 12 13 14 15 16 17 18 19 
x 24 26 26 27 28 30 30 33 36 
y 46.20 50.35 49.13 48.07 50.90 54.78 54.32 57.17 57.45 
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From the sample data, 


I _ ol 2 

= 75m = 22.737 ft Y= 75 dd = 46.217 8/t 
Sx = $0 (x; — 22.737)” = 913.684 Sy = S© (9; — 46.217)” = 924.436 
Sy =) (x; — 22.737)(y; — 46.217) = 901.944 


> Sy 901.944 
— — iy 7 
Bi Six 913.684 ae 


By =¥ — B,xX = 46.217 — 0.987(22.737) = 23.8 


The equation of the LSRL is y = 23.8 + .987x. We estimate that the change in expected sale price 
associated with a 1-ft increase in truss height is .987, or about 99 cents per square foot. The intercept 
of 23.8, while important for correctly summarizing the data, does not have a direct interpretation— 
after all, it doesn’t make sense for a warehouse to have a truss height of x = 0 feet (how would you 
store anything?). Figure 12.8, generated by the statistical software package R, shows that the least 
squares line provides an excellent summary of the relationship between the two variables. 


Price per square foot 


40 


Truss height 


Figure 12.8 A scatterplot of the data in Example 12.4 with the LSRL superimposed, from R a 


The LSRL can immediately be used for two different purposes. For a fixed x value x", By + B x 
(the height of the line above x*) gives both (1) a point estimate of the mean value of Y when x = x* 
and (2) a point prediction of the Y value that will result from a single new observation made at x = x”. 

The least squares line should not be used to make a prediction for an x value much beyond the 
range of the data, such as x = 5 or x = 45 in Example 12.4. The danger of extrapolation is that the 
fitted relationship (a line here) may not be valid for such x values. 


Example 12.5 (Example 12.4 continued) A point estimate for the true average price for all ware- 
houses with 25-ft truss height is 
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jtyps = By + By (25) = 23.8 + .987(25) = $48.48 /f? 


This also represents a point prediction for the price of a single warehouse with 25-ft truss height. 
Notice that although no sample observations had x = 25, this value lies in the “middle” of the set of 
x values (see Figure 12.8). This is an example of interpolation: using the LSRL for x values that 
were unseen but are consistent with the sample data. 

A point estimate for the true average price for all warehouses with 50-ft truss height is 


jtyso = Bo + B,(S0) = 23.8 + .987(50) = $73.15/ft 


However, because this calculation involves an extrapolation—the value x = 50 is well outside the 
bounds of the available data—we have much less faith that this estimated cost is accurate. a 


Residuals and Estimating o 

The parameter o determines the amount of variability inherent in the regression model. A large value 
of o will lead to observed (x;, y;)’s that are typically quite spread out about the true regression line, 
whereas when o is small the observed points will tend to fall very close to the true line (see 
Figure 12.9). An estimate of o will be used in confidence interval formulas and hypothesis-testing 
procedures presented in the next two sections. Because the equation of the true line is unknown, the 
estimate is based on the extent to which the sample observations deviate from the estimated line. 


a b 
y = Elongation y = Product sales 
A 


x = Tensile force x = Advertising expenditure 


Figure 12.9 Typical sample for o: (a) small; (b) large 


DEFINITION The fitted (or predicted) values 3), .. ., y, are obtained by successively substituting 
the x values x1, ..., X, into the equation of the LSRL: the ith fitted value is 


91 = Bo+ Bix =¥+ By (i —¥) 


The residuals ¢,,..., €, are the vertical deviations from the LSRL: the ith residual is 


e: =yi-3i=Yi— (Bo + Bix’) = (i — J) — Bi (ei —¥) (12.4) 
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In words, the predicted value ¥; is the value of y that we would predict or expect when using the 
estimated regression line with x = x;; ¥; is the height of the estimated regression line above x;. The 
residual e; is the difference between the observed y; and the predicted yj. 

Assuming that the line in Figure 12.7 is the least squares line, the residuals are identified by the 
vertical line segments from the observed points to the line. In fact, the principle of least squares is 
equivalent to determining the line for which the sum of squared residuals is minimized. If the 
residuals are all small in magnitude, then much of the variability in observed y values appears to be 
due to the linear relationship between x and y, whereas many large residuals suggest quite a bit of 
inherent variability in y relative to the amount due to the linear relation. The residuals from the LSRL 
always satisfy }> e; = 0 and so é@ = 0 (see Exercise 24; in practice, the sum may deviate a bit from 
zero due to rounding). 

The ith residual e; may also be regarded as a proxy for the unobservable “true” error ¢; for the ith 
observation: 


true error: ¢; = y; — (By + B, xi) 


estimated error: e; = & = y; — (Bo ate B,x;) 


The sum of squared residuals is used here to estimate the standard deviation o of the ¢;’s in the same 
way that the sum of squares S,, was previously used to estimate a population sd. 


DEFINITION The error sum of squares (or residual sum of squares), denoted by SSE, is 


SSE = > (e— 2)? = Se? = $0 1-5)? 


and the least squares estimate of 07 is 


SSE p=)" 
a oe mo, 5i) 


eo  n—2 n—2 


The estimate s. = ,/SSE/(n — 2) of a is called the residual standard deviation. 


The divisor n — 2 in s, is the number of degrees of freedom (df) associated with the estimate (or, 
equivalently, with the error sum of squares). This is because to obtain s,, the two parameters fo and f, 
must first be estimated, which results in a loss of 2 df (just as u had to be estimated in one-sample 
problems, resulting in an estimated variance based on n — 1 df). Equivalently, the normal equations 
impose two constraints; as a result, if n — 2 of the residuals are known, then the remaining two are 
completely determined (so only n — 2 are freely determined; see Exercise 24). 

Replacing each y; in the formula for s, by the rv Y; gives the estimator S,. It can be shown that 2 is 
an unbiased estimator for 07, although the estimator S, is biased for o. (The mle of a” based on the 
normal model has divisor n rather than n — 2, so it is biased.) 

The interpretation of s, here is similar to that of o given earlier. Roughly speaking, s, is the size of 
a “typical” or “representative” deviation from the least squares line. 
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Example 12.6 Japan’s high population density has resulted in a multitude of resource usage 
problems. One especially serious difficulty concerns waste removal. The article “Innovative Sludge 
Handling Through Pelletization Thickening” (Water Res. 1999: 3245-3252) reported the develop- 
ment of a new compression machine for processing sewage sludge. An important part of the 
investigation involved relating the moisture content of compressed pellets (y, in %) to the machine’s 
filtration rate (x, in kg-DS/m/h). The following data was read from a graph in the paper: 


x 125.3 98.2 201.4 147.3 145.9 124.7 112.2 120.2 161.2 178.9 
y 71.9 76.8 81.5 79.8 78.2 78.3 ALD. 77.0 80.1 80.2 
x 159.5 145.8 75.1 151.4 144.2 125.0 198.8 132.5 159.6 110.7 
y 79.9 79.0 76.7 78.2 79.5 78.1 81.5 77.0 79.0 78.6 


Relevant summary quantities are x = 140.895, y = 78.74, S,, = 18,921.8295, and S,, = 776.434, 
from which 


, _ 776.434 
1 18,921.8295 


By = 78.74 — (.04103377) (140.895) = 72.958547 © 72.96 


= .04103377 = .041 


The equation of the least squares line is y = 72.96 + .041x. For numerical accuracy, the fitted values 
are calculated from y; = 72.958547 + .04103377x;: 


31 = 72.958547 + .04103377(125.3) + 78.100, e; =; —$,  —.200, etc. 


A positive residual corresponds to a point in the scatterplot that lies above the graph of the least 
squares line, whereas a negative residual results from a point lying below the line. All predicted 
values (fits) and residuals appear in the accompanying table. 


Obs. (i) Filtrate (x,) Moist. Con. (y,) Fit (;) Residual (e;) 
1 125.3 71.9 78.100 —0.200 
2 98.2 76.8 76.988 —-0.188 
3 201.4 81.5 81.223 0.277 
4 147.3 79.8 79.003 0.797 
5 145.9 78.2 78.945 —0.745 
6 124.7 78.3 78.075 0.225 
7 112.2 715 77.563 —0.063 
8 120.2 77.0 77.891 —0.891 
9 161.2 80.1 79.573 0.527 
10 178.9 80.2 80.299 —0.099 
11 159.5 79.9 79.503 0.397 
12 145.8 79.0 78.941 0.059 
13 75.1 76.7 76.040 0.660 
14 151.4 78.2 T9A71 -0.971 
15 144.2 79.5 78.876 —0.624 
16 125.0 78.1 78.088 0.012 
17 198.8 81.5 81.116 0.384 
18 132.5 77.0 78.396 —1.396 
19 159.6 79.0 79.508 —0.508 
20 110.7 78.6 77.501 1.099 
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It can be verified that, rounding error notwithstanding, the residuals (the last column) sum to 0. 
The corresponding residual sum of squares is 


SSE = (—.200)? + (—.188)? + --- + (1.099)? = 7.968 


The estimate of a” is then 6? = s? = 7.968/(20 — 2) = .4427, and the residual standard deviation is 


6 = Se = V 4427 = .665. Roughly speaking, .665 is the typical difference between the actual 
moisture concentration of a specimen and its predicted moisture concentration based on the LSRL. 


Computation of SSE from the defining formula involves much tedious arithmetic, because both the 
predicted values and residuals must first be calculated. Use of the following formula does not require 
these quantities (again see Exercise 24), though )~ y; and > y? are needed. 


SSE = Syy — Si, /Syx 


The Coefficient of Determination 

Figure 12.10 shows three different scatterplots of bivariate data. In all three plots, the heights of the 
different points vary substantially, indicating that there is much variability in observed y values. The 
points in the first plot all fall exactly on a straight line. In this case, all (100%) of the sample variation 
in y can be attributed to the fact that x and y are linearly related in combination with variation in x. The 
points in Figure 12.10b do not fall exactly on a line, but compared to overall y variability, the 
deviations from the least squares line are small. It is reasonable to conclude in this case that much of 
the observed y variation can be attributed to the approximate linear relationship between the variables 
postulated by the simple linear regression model. When the scatterplot looks like that of 
Figure 12.10c, there is substantial variation about the least squares line relative to overall y variation, 
so the simple linear regression model fails to explain much of the variation in y by relating y to x. 


Figure 12.10 Explaining y variation: (a) all variation explained; (b) most variation explained; 
(ce) little variation explained 


The error sum of squares SSE can be interpreted as a measure of how much variation in y is left 
unexplained by the model—that is, how much cannot be attributed to a linear relationship. In 
Figure 12.10a, SSE = 0 and there is no unexplained variation, whereas unexplained variation is small 
for the data of Figure 12.10b and much larger in Figure 12.10c. A quantitative measure of the total 
amount of variation in the observed y values is given by the total sum of squares 
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SST = S$° (i; - 9) = Sy 


Figure 12.11 illustrates the difference between these two sums of squares. The (x, y) points in the two 
scatterplots are identical. While SSE measures the deviation of the y values from the LSRL (a 
“model” that uses x as a predictor), SST measures the deviation of the y values from the horizontal 
line y = y (essentially ignoring the presence of x). Since the least squares line is by definition the line 
having the smallest sum of squared vertical deviations, SSE can’t be any larger than SST, and usually 
it is much smaller. 


Horizontal line at height y 
Least squares line 


we y 


> X 


> xX 


Figure 12.11 Sums of squares illustrated: (a) SSE = sum of squared deviations about the least squares line; 
(b) SST = sum of squared deviations about the horizontal line y = 


Dividing SSE by SST gives the proportion of total variation that is not explained by the 
approximate linear relationship. Subtracting this ratio from 1 results in the proportion of total vari- 
ation that is explained by the relationship. 


DEFINITION The coefficient of determination, denoted by R’, is given by 


SSE 
a 
SST 
R? is interpreted as the proportion of observed y variation that can be explained by 
the simple linear regression model (i.e., attributed to an approximate linear rela- 
tionship between y and x). 


The closer R* is to 1, the more successful the simple linear regression model is in explaining 
y variation. Multiplying R? by 100 gives the percentage of total variation explained by the rela- 
tionship; software often reports R* this way. 

Said differently, R* is the proportion by which the error sum of squares is reduced by the 
regression line compared to the horizontal line. For example, if SST = 20 and SSE = 2, then 
R* = 1 — (2/20) = .9, so the regression reduces the error sum of squares by 90%. 

Although it is common to have R* values of .9 or more in engineering and the physical sciences, 
R’ is likely to be much smaller in social sciences such as psychology and sociology, where values far 
less than .5 are common but still considered important. 
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Example 12.7 (Example 12.5 continued) The scatterplot of the truss height-sale price data in 
Figure 12.8 indicates a fairly high R* value. Previous computations showed that SST = S,,, = 
924.436, Sry = 901.944, and S,, = 913.684. Using the computational shortcut, 


SSE = 924.436 — (901.944)?/913.684 = 34.081 


The coefficient of determination is then 


34.081 


R=1- = 
924.436 


1 — .037 = .963 


That is, 96.3% of the observed variation in warehouse price is attributable to (can be explained by) the 


approximate linear re 


lationship between price and truss height, a fairly impressive result. The R? can 


also be interpreted by saying that the error sum of squares using the regression line is 96.3% less than 
the error sum of squares using a horizontal line (i.e., ignoring truss height). 

Figure 12.12 shows partial Minitab output for the warehouse data; the package will also provide 
the predicted values and residuals upon request, as well as other information. The formats used by 
other packages differ slightly from that of Minitab, but the information content is very similar. 


Quantities in Figure 


12.12 such as the standard deviations, ¢ ratios, and the details of the ANOVA 


table are discussed in Section 12.3. 


The regr 
Sales Pr 


Predicto 


Constant 
Truss He 
Ss = 1.41 
Analysis 
Source 

Regressi 


Residual 
Total 


ssion equation is 


ice = 23.8 + 0.987 Truss Height 
r Coef SE Coef iT P 

23.7728, 1.113 21.35 0.000 
ight 0.98715< B, 0.04684 21.07 0.000 
590<se R-Sq = 96.3%<100R? R-Sq(adj) = 96.1% 
of Variance 

DF SS MS F P 
on 1 890.36 890.36 444,11 0.000 
Error 17 34.08<SSE 2.00 


18 924 .44<-SST 


Figure 12.12 Minitab output for the regression of Example 12.7 a 


For regression there is an analysis of variance identity like the fundamental identity (11.1) in 
Chapter 11. Add and subtract }; in the total sum of squares: 


sst= 5° 


(i - 9) = Soli - 5) + GH -— WP = D5 Oi - 5° + 45 Gi - 9)? 
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Notice that the middle (cross-product) term is missing on the right; see Exercise 24 for the justification. 
Of the two sums on the right, the first is SSE = > (y; — i) and the second is something new, the 
regression sum of squares, SSR = >> (3; — age The analysis of variance identity for regression is 


SST = SSE+ SSR (12.5) 
The coefficient of determination can now be written in a slightly different way: 


SSE _SST—SSE_ SSR 
SST SST SST 


R=1 


The ANOVA table in Figure 12.12 shows that SSR = 890.36, from which R* = 890.36/924.44 
= .936 as before. Hence we interpret the regression sum of squares SSR as the amount of total 
variation explained by the model, so that R° is the ratio of explained variation to total variation. 


Exercises: Section 12.2 (13-30) 


13. Exercise 4 gave data on x = BOD mass Termination Metallurgy” (Plating and 
loading and y=BOD mass removal. Surface Finishing, Jan. 1997: 38-40). Do 
Values of relevant summary quantities are you agree with the claim by the article’s 

author that “a linear relationship was 
n= 14 Vx =517 Diy; = 346 obtained from the tin—lead rate of deposi- 


Sy Res Bye eae tion as a function of current density”? 


. : Explain your reasoning. 
a. Obtain the equation of the least squares 


line. x 20 40 60 80 
b. Predict the value of BOD mass removal 5 4 1.20 71 2.29 
for a single observation made when 
BOD mass loading is 35, and calculate 15. The efficiency ratio for a steel specimen 


the value of the corresponding residual. immersed in a phosphating tank is the 
c. Calculate SSE and then a point estimate weight of the phosphate coating divided by 
of o. the metal loss (both in mg/ft’). The article 
d. What proportion of observed variation “Statistical Process Control of a Phosphate 
in removal can be explained by the Coating Line” (Wire J. Internat., May 1997: 
approximate linear relationship 78-81) gave the accompanying data on tank 
between the two variables? temperature (x) and efficiency ratio (y). 


e. The last two x values, 103 and 142, are 
much larger than the others. How are 
the equation of the least squares line 
and the value of R? affected by deletion 
of the two corresponding observations 
from the sample? Adjust the given 
values of the summary quantities, and Temp. | 181 182 182182182 184184 
use the fact that the new value of SSE Ratio 1.43 90 181 194 268 149 2.52 
is 311.79. Temp. | 185 186 188 

Ratio 3.00 1.87 3.08 


Temp. | 170 172 173 174 174 175 176 
Ratio 84 131 142 1.03 1.07 1.08 1.04 


Temp. 177 180 180 180 180 180 181 
Ratio 180 145 160 161 213 2.15 84 


14. The accompanying data on x = current 
density (mA/cm?) and y = rate of deposi- 
tion (mm/min) appeared in the article 
“Plating of 60/40 Tin/Lead Solder for Head 


a. Determine the equation of the esti- 
mated regression line. 
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b. Calculate a point estimate for true 
average efficiency ratio when tank 
temperature is 182. 

c. Calculate the values of the residuals 
from the least squares line for the four 
observations for which temperature is 
182. Why do they not all have the same 
sign? 

d. What proportion of the observed vari- 
ation in efficiency ratio can be attrib- 
uted to the simple linear regression 
relationship between the two variables? 


16. The scientist Francis Galton, an early 
developer of regression methodology, used 
“midparent height,” the average of the 
father’s and mother’s heights, in order to 
predict children’s heights. Here are the 
heights of 11 female students along with 
their midparent heights in inches: 


Midparent 66.0 65.5 71.5 68.0 70.0 65.5 
Daughter 64.0 63.0 69.0 69.0 69.0 65.0 


Midparent) 67.0 70.5 69.5 64.5 67.5 
Daughter 63.0 68.5 69.0 64.0 67.0 


a. Construct a scatterplot of daughter’s 
height against the midparent height and 
comment on the strength of the 
relationship. 

b. Is the daughter’s height completely and 
uniquely determined by the midparent 
height? Explain. 

c. Use the accompanying Minitab output 
to obtain the equation of the least 
squares line for predicting daughter 
height from midparent height, and then 
predict the height of a daughter whose 
midparent height is 70 in. Would you 
feel comfortable using the least squares 
line to predict daughter height when 
midparent height is 74 in.? Explain. 


Predictor Coef SE Coef T P 

Constant 1.5.65 13.36 0.12 0.904 
midparent 0.9555 0.1971 4.85 0.001 
S$=1.45061 R-Sq = 72.3% R-Sq(adj) = 69.2% 
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Analysis of Variance 


Source DF SS MS F P 
Regression 1 49.471 49.471 23.51 0.001 
Residual 9 18.938 2.104 

Error 

Total 10 68.409 


d. What are the values of SSE, SST, and the 
coefficient of determination? How well 
does the midparent height account for the 
variation in daughter height? 


17. The article “Characterization of Highway 
Runoff in Austin, Texas, Area” (J. Environ. 
Engr. 1998: 131-137) gave a scatterplot, 
along with the least squares line, of 
x = rainfall volume (m?) and y = runoff 
volume (m?) for a particular location. The 
accompanying values were read from the 
plot. 

x | 5 12 14 17 23 30 40 47 

y 4 10 13 15 15 25 27 46 

x | 55 67 72 81 96 112 127 

y 38 46 53 70 82 99 100 


a. Does a scatterplot of the data support 
the use of the simple linear regression 
model? 

b. Calculate point estimates of the slope 
and intercept of the population regres- 
sion line. 

c. Calculate a point estimate of the true 
average runoff volume when rainfall 
volume is 50. 

d. Calculate a point estimate of the stan- 
dard deviation o. 

e. What proportion of the observed 
variation in runoff volume can be 
attributed to the simple linear regression 
relationship between runoff and rainfall? 


18. A regression of y = calcium content (g/L) 
on x = dissolved material (mg/cm?) was 
reported in the article “Use of Fly Ash or 
Silica Fume to Increase the Resistance of 
Concrete to Feed Acids” (Mag. Concrete 
Res. 1997: 337-344). The equation of the 
estimated regression line was y = 3.678 + 
.144x, with R* = .860, based on n = 23. 
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19. 


x 
y 


x 


20. 


a. Interpret the estimated slope .144 and 
the coefficient of determination .860. 

b. Calculate a point estimate of the true 
average calcium content when the 
amount of dissolved material is 
50 mg/cm”. 

c. The value of total sum of squares was 
SST = 320.398. Calculate an estimate 
of the error standard deviation o in the 
simple linear regression model. 


The cetane number is a critical property in 
specifying the ignition quality of a fuel 
used in a diesel engine. Determination of 
this number for a biodiesel fuel is expen- 
sive and time-consuming. The article 
“Relating the Cetane Number of Biodiesel 
Fuels to Their Fatty Acid Composition: A 
Critical Study” (J. Automobile Engr. 2009: 
565-583) included the following data on 
x = iodine value (g) and y = cetane number 
for a sample of 14 biofuels. The iodine 
value is the amount of iodine necessary to 
saturate a sample of 100g of oil. The 
article’s authors fit the simple linear 
regression model to this data, so let’s fol- 
low their lead. 


132.0 129.0 120.0 113.2 105.0 92.0 84.0 
46.0 48.0 51.0 52.1 54.0 52.0 59.0 


83.2 88.4 59.0 80.0 81.5 71.0 69.2 
58.7 61.6 64.0 61.4 54.6 58.8 58.0 


a. Obtain the equation of the least squares 
line, and then calculate a point predic- 
tion of the cetane number that would 
result from a single observation with an 
iodine value of 100. 

b. Calculate and interpret the coefficient 
of determination. 

c. Calculate and interpret a point estimate 
of the model standard deviation o. 


A number of studies have shown lichens 
(certain plants composed of an alga and a 
fungus) to be excellent bioindicators of air 
pollution. The article “The Epiphytic 
Lichen Hypogymnia physodes as a 


Constant 
No3depo 
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Biomonitor of Atmospheric Nitrogen and 
Sulphur Deposition in Norway” (Environ. 
Monit. Assess. 1993: 27-47) gives the fol- 
lowing data (read from a graph) on 
x = NO; wet deposition (gN/m?) and 
y = lichen N (% dry weight): 


.05 10 11 12 31 37 42 
A8 55 A8 50 58 2 1.02 
58 68 68 3 85 92 


86 86 1.00 88 1.04 1.70 


The author used simple linear regression to 
analyze the data. Use the accompanying 
Minitab output to answer the following 
questions: 


a. What are the least squares estimates of 
Bo and B,? 

b. Predict lichen N for an NOJ deposition 
value of .5. 

c. What is the estimate of o? 

d. What is the value of total variation, and 
how much of it can be explained by the 
model relationship? 


The regression equation is 
lichen N = 0.365 + 0.967 No3depo 


Predictor Coef Stdev tratio P 
0.36510 0.09904 3.69 
0.9668 0.1829 5.29 

S=0.1932 R-sq=71.7% R-sq (adj) = 69.2% 


Analysis of Variance 


Source DF SS MS F P 
Regression 1 1.0427 0.4106 27.94 0.000 
Error 11 0.4106 0.0373 

Total 11 1.4533 

21. Visual and musculoskeletal problems 


associated with the use of visual display 
terminals (VDTs) have become rather 
common in recent years. Some researchers 
have focused on vertical gaze direction as a 
source of eye strain and irritation. This 
direction is known to be closely related to 
ocular surface area (OSA), so a method of 
measuring OSA is needed. The accompa- 
nying representative data on y=OSA 
(cm?) and x = width of the palprebal fissure 
(i.e., the horizontal width of the eye 


0.004 
0.000 
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22: 


x 
y 
x 


y 


opening, in cm) is from the article “Anal- 
ysis of Ocular Surface Area for Comfort- 
able VDT Workstation Layout” 
(Ergonomics 1996: 877-884). 


40 42. 48 iil 57 .60 70 75 


102 121 88 98 152 183 150 1.80 
75 78 84 95 99 103 1.12 
1.74 1.63 2.00 280 248 247 3.05 
115 1.20 1.25 1.25 1.28 1.30 1.34 1.37 
3.18 3.76 3.68 3.82 3.21 4.27 3.12 3.99 
140 143 146 149 155 1.58 1.60 
3.75 4.10 4.18 3.77 4.34 4.21 4.92 
a. Construct a scatterplot of this data. 


Describe what you find. 

b. Calculate the equation of the LSRL. 

c. Interpret the slope of the LSRL. 

d. What OSA would you predict for a 
subject whose palprebal fissure width is 
1.25 cm? 

e. What would be the estimate of expec- 
ted OSA for people with palprebal fis- 
sure width of 1.25 cm? 


For many years, rubber powder has been 
used in asphalt cement to improve perfor- 
mance. The article “Experimental Study of 
Recycled Rubber-Filled High-Strength 
Concrete” (Mag. Concrete Res. 2009: 
549-556) included a regression of y = axial 
strength (MPa) on x=cube strength 
(MPa) based on the following sample data: 


112.3 97.0 92.7 86.0 102.0 
75.0 71.0 57:7 48.7 74.3 
99.2 95.8 103.5 89.0 86.7 
733 68.0 59.3 57.8 48.5 


a. Verify that a scatterplot supports the 
assumption that the two variables are 
related via the simple linear regression 
model. 

b. Obtain the equation of the least squares 
line, and interpret its slope. 

c. Calculate and interpret the coefficient 
of determination. 


23. 


24. 


25: 


26. 


27. 


28. 
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d. Calculate and interpret an estimate of 
the error standard deviation o in the 
simple linear regression model. 

e. The largest x value in the sample 
considerably exceeds the other x val- 
ues. What is the effect on the equation 
of the least squares line of deleting 
the corresponding observation? 


Show that under the assumptions of the 
simple linear regression model, the mles of 
Bo and f, are identical to the least squares 
estimates. [Hints: (1) The pdf of Y; is nor- 
mal with mean yp; = fo + 1x; and variance 
oO: the likelihood function is the product of 
the n pdfs. (2) You don’t need to differen- 
tiate the likelihood function; instead, find 
the correspondence between that function 
and the least squares expression g(bo, b,).] 


a. Show that the residuals e,,..., e, satisfy 
both > ei = 0 and > (x; = X)e; = 0. 
[Hint: Use the last expression for e; in 
(12.4), along with the fact that for any 
numbers ay,...,4n, >, (a; — a) = 0.] 

b. Show that §;—¥ = ,(x; —3). 

c. Use (a) and (b) to derive the analysis of 
variance identity for regression, Equa- 
tion (12.5), by showing that the cross- 
product term is 0. 

d. Use (b) and Equation (12.5) to verify 
the computational formula for SSE. 


A regression analysis is carried out with 
y = temperature, expressed in °C. How do 
the resulting values of Bo and B, relate to 
those obtained if y is re-expressed in °F? 
Justify your assertion. [Hint: (new y;) = 
1.8y; + 32.] 


Show that b; and bo of Expressions (12.2) 
and (12.3) satisfy the normal equations. 


Show that the “point of averages” (x, y) lies 
on the estimated regression line. 


Suppose an investigator has data on the 
amount of shelf space x devoted to display 
of a particular product and sales revenue 
y for that product. The investigator may 
wish to fit a model for which the true 
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regression line passes through (0,0). The 30. Consider the following three data sets, in 


appropriate model is Y = f,x + ¢. Assume 
that (x1,¥1),---,(%n,¥n) are observed pairs 
generated from this model, and derive the 
least squares estimator of f,. [Hint: Write 
the sum of squared deviations as a function 
of b;, a trial value, and use calculus to find 


which the variables of interest are 
x = commuting distance and y = commut- 
ing time. Based on a scatterplot and the 
values of s, and R’, in which situation 
would simple linear regression be most 
(least) effective, and why? 


the minimizing value of b,.] 


29. a. Consider the data in Exercise 20. Sup- 
pose that instead of the least squares 
line that passes through the points 
(x1,91),+--; (Xn, Yn), we wish the least 
squares line that passes through 18 42 20 45 20 3 
(x1 — X,y1),---; %n — ¥, yn). Construct 19 49 25 63 25 31 
a scatterplot of the (x;, y;) points and 2000-460 50 SS 5 


then of the (x; — X, y;) points. Use the Sux 17.50 1270.8333 1270.8333 
plots to explain intuitively how the two Sry 29.50 2722.5 1431.6667 
: B 1.685714 2.142295 1.126557 
least squares lines are related. 1 
: B 13.666672 7.868852 3.196729 
b. Suppose that instead of the model Bo 
y.— p49 SST 114.83 5897.5 1627.33 
i=Bot+Biaite (i=1,...,n), we SSE 65.10 65.10 14.48 


wish to fit a model of the form 
Y¥;=fPot+Bii-X+e (i= 1,...,n). 
What are the least squares estimators of 
Bo and f;, and how do they relate to Bo 


and B,? 


12.3 Inferences About the Regression Coefficient [, 


In virtually all of our inferential work thus far, the notion of sampling variability has been pervasive. 
In particular, properties of sampling distributions of various statistics (X, P, and so on) have been the 
basis for developing confidence interval formulas and hypothesis-testing methods. The key idea is 
that the value of virtually any quantity calculated from sample data (i.e., any statistic) is going to vary 
from one sample to another. 


Example 12.8 Reconsider the data on x = truss height and y = sale price per square foot from 
n = 19 warehouses in Example 12.4 of the previous section. Suppose the simple linear regression 
model applies here, with parameter values B; = 1, 6y = 25, and o = 1.4 (consistent with the estimates 


computed previously). To understand the sampling variability of the statistics Bo and B,. we per- 
formed the following simulation 250 times in R: 


e Generate random errors €),...,€9 from a normal distribution with mean 0 and standard deviation 
o=144. 

e Using the 19 x;’s from the original data set, generate response values according to the model 
equation: 
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yi = Bot Bix te; = 254 1x; + 6; i=1,2,...,19 


e Perform least squares regression on the simulated (x;, y;) pairs to obtain the estimated slope and 
intercept. 


Figure 12.13 shows histograms of the Bo and B , values resulting from this simulation. There is 
clearly variation in values of the estimated slope and estimated intercept. The equation of the LSRL 
thus also varies from one sample to the next. Note, though, that the estimates are centered close to the 
true values, an indication of unbiasedness. 
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Figure 12.13 Histograms approximating the sampling distributions of fo and fp, | 


The slope f{, of the population regression line is the true change in the mean of the response 
variable Y associated with a one-unit increase in the explanatory variable x. The slope of the least 
squares line, B 1, gives a point estimate of f,. In the same way that a confidence interval for 4 and 
procedures for testing hypotheses about jz were based on properties of the sampling distribution of X, 
inferences about f, are based on the sampling distribution of B 1 

The values of the x;’s are assumed to be chosen before the study is performed, so only the Y;’s are 
random. The estimators for fo and f, are obtained by replacing y; with Y; in (12.2) and (12.3): 


SN 2) Oe RK ee 
Bi = St (x; — x) = Se ? Bo = Y¥ —B\x 


Similarly, the estimator for o results from replacing each y; in the formula for s, by the rv Y;: 


pee jee Ee Be? 
. n—2 n—2 


The denominator of B.. Sex = 5 i - i), depends only on the x;’s and not on the Y;,’s, so it is a 
constant. Then, because >> [(x;—x)Y] = ¥Y >> (x; x) =Y-0=0, the slope estimator can be 
re-written as 
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x i — X)Y; x 
B, = a — say where cj = (x; — ¥)/Sxx 


That is, the estimator B , is a linear function of the independent rvs Yj, Yo, ..., Y,,, each of which is 
normally distributed. Invoking properties of a linear function of random variables discussed in 
Section 5.3 leads to the following results (Exercise 40). 


PROPERTIES OF THE The mean value of f, is E(B,) = ua = so B, is an unbiased 
ESTIMATED SLOPE Br is E(B) = 9, . ae 
estimator of f, (i.e., the distribution of /, is always centered at the 


true value of f;). 


2. The variance and standard deviation of B, are 


2 
A > 6 o 
OT al are 8, 


Replacing o by its estimate s, gives an estimate for Op: 


S 
= §s 


Se ae é€ 
Bi a V Sixx a SyV nN — 1 


a 


3. The estimator f, has a normal distribution, because it is a linear 
function of independent normal rvs. 


Properties 1 and 3 manifest themselves in the B , histogram of Figure 12.13. According to Property 2, 
the standard deviation of B, equals the standard deviation o of the random error term—or, equiva- 
lently, of any Y—divided by s,./n — 1. Because the sample standard deviation s, is a measure of how 
spread out the x;’s are about x, we conclude that making observations at x; values that are quite spread 
out results in a more precise estimator of the slope parameter (smaller variance of B 1), Whereas values 
of x; all close to each other imply a highly variable estimator. Of course, if the x;’s are spread out too 
far, a linear model may not be appropriate throughout the range of observation. Finally, the presence 
of n in the denominator of s A, implies that the estimated slope varies less for larger samples than for 


smaller ones. We have seen this feature previously in other statistics such as X and P: as sample size 
n increases, the distribution of the statistic “collapses onto” the true value of the corresponding 
parameter. 

Many inferential procedures discussed previously were based on standardizing an estimator by 
first subtracting its mean value and then dividing by its estimated standard deviation. In particular, test 
procedures and a CI for the mean y of a normal population utilized the fact that the standardized 
variable (X — )/(S/\/n) has a t distribution with n — 1 df. A similar result here provides the key to 
further inferences concerning f. 


730 12 Regression and Correlation 


THEOREM The assumptions of the simple linear regression model imply that the standardized 
variable 


Bi - Bi _ BiB 
Sp, Saf Saxe 


l= 


has a ¢ distribution with n — 2 df. 


The T ratio can be written as 


BB 
pe fix hi o/VSxx 


See if — 2)S2/? 
(n—2) 


The theorem is a consequence of the following facts: (1) (B; — B,)/(o/VSu) ~ N(O, 1), 
(2) (n — 2)S2/a? ~ 72_,, and (3) B, is independent of S,. That is, Tis a standard normal rv divided by 
the square root of an independent chi-squared rv over its df, and so has the specified ¢ distribution. 


A Confidence Interval for p, 
As in the derivation of previous CIs, we begin with a probability statement: 


P( tava A By <tynea] =l1-4 


By 


Manipulation of the inequalities inside the parentheses to isolate 6, and substitution of estimates in 
place of the estimators gives the following CI formula. 


A 100(1 — «)% CI for the slope f, of the true regression line has endpoints 


a pe aes ; i: ; Se 
By XE ty/2,n—-2* SB, = By by /2,n—2 VSu ty /2n—2 on =i 


This interval has the same general form as did many of our previous intervals. It is centered at the 
point estimate of the parameter, and the amount it extends out to each side of the estimate depends on 
the desired confidence level (through the f¢ critical value) and on the amount of variability in the 


estimator B , (through s Be which will tend to be small when there is little variability in the distribution 
of B , and large otherwise). 


Example 12.9 The scatterplot in Figure 12.14 shows the size (x, in square feet) and monthly rent (y, 
in dollars) for a random sample of n = 77 two-bedroom apartments in Omaha, NE (courtesy of www. 
zillow.com/omaha-ne/rentals). The plot suggests, not surprisingly, that rent generally increases with 
apartment size, and that for any fixed apartment size there is variability in monthly rents. 
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Figure 12.14 Scatterplot of the data from Example 12.8 


Summary quantities include 


x = 1023.5 Sy = 161.9 Sx = 1,991,569 

y = 1006.6 sy = 113.3 Syy = 975,840 

Sry = 1,042,971 
from which B, =.5237, Bo = 470.6, SST = S,, = 975,840, SSE = 429,642, and R? = 5597. 
Roughly 56% of the observed variation in monthly rent can be attributed to the simple linear 
regression model relationship between rent and apartment size. The remaining 44% of rent variation 
is due to other apartment features, such as neighborhood, nicer appliances, or dedicated parking. Error 
df isn —-2=77 —2=75, giving s?* = 429,642/75 = 5728.56 and s, = 75.69. 


The estimated standard deviation of B 1 1S 


, - Se - 75:69 
Br JSx—-V1,,991,569 


= .0536 


The ¢ critical value for a confidence level of 95% is t.o25.75 = 1.992, so a 95% CI for f, is 


5237 + 1.992(.0536) = (.4169, .6305) 


With a high degree of confidence, we estimate that a one square foot increase in an apartment’s size is 
associated with an increase between $.4169 and $.6305 in the expected monthly rent. This applies to 
the population of all two-bedroom apartments in Omaha. Multiplying by 100 gives $41.69 to $63.05 
as the increase in expected rent associated with a 100 ft* size increase. 

Looking at the R output of Figure 12.15, we find the value of s hy under coefficients as the second 


number in the standard error column, while the value of s, is displayed as residual standard error. 


There is also an estimated standard error for the statistic Bo. For all of the statistics, compare the 
values in the R output with the values calculated above. 
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Call: 
ImCformula = Rent ~ Size, data = Omaha) 


Residuals: 
Min 1Q Median 3Q Max 
-144.314 -52.306 1.658 40.635 189.477 


Coefficients: 

Estimate Std. Error t value Pr(>|t]l) 
(Intercept) 470.62001 55.56499 8.470 1.52e-12 *** 
Size 0.52369 0.05363 9.765 5.30e-15 *** 


Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ O.1 ‘ ’ 1 


Residual standard error: 75.69 on 75 degrees of freedom 
Multiple R-squared: 0.5597, Adjusted R-squared: 0.5539 
F-statistic: 95.35 on 1 and 75 DF, p-value: 5.302e-15 


Figure 12.15 R output for the data of Example 12.9 a 


Hypothesis-Testing Procedures 

As before, the null hypothesis in a test about £, will be an equality statement. The null value of f, 
claimed true by the null hypothesis will be denoted by 9 (read “beta one naught,” not “beta ten’). 
The test statistic results from replacing f, in the standardized variable T by the null value £j)>—that is, 
from standardizing B , under the assumption that Hp is true. The test statistic thus has a ¢ distribution 
with n — 2 df when Hp is true, so the type I error probability is controlled at the desired level « by 
using an appropriate ¢ critical value. 


Null hypothesis: Ho: £: = Bio 


A 


Test statistic value: t= Aho 
S, 
A, 
Alternative Hypothesis Rejection Region for Level a Test 
Fa: By > Bio t= tan2 
Aa: Bi <fByo t<—-tan2 
Ha: Bi F Bio either ¢> tyn2 OF CS —bas2n-2 


A P-value based on n — 2 df can be calculated just as was done previously for ¢ tests in Chapters 
9 and 10. 


The most commonly encountered pair of hypotheses about fh, is Ho: 6, = 0 versus H,: 6; A 0, in 
which case the test statistic value is the ¢ ratio B, |S When this null hypothesis is true, 


E(Y|x) = By + Ox = fo, independent of x, so knowledge of x gives no information about the value of 
the response variable. A test of these two hypotheses is often referred to as the model utility test in 
simple linear regression. 

Unless n is quite small, Hp will be rejected and the utility of the model confirmed precisely when 
R’ is reasonably large. The simple linear regression model should not be used for further inferences, 
such as estimates of mean value or predictions of future values (the topics of Section 12.4), unless the 
model utility test results in rejection of Ho for a suitably small «. 
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Example 12.10 How is the perceived risk of an investment related to its expected return? Intuitively 
it would seem as though riskier investments would be associated with higher expected returns. The 
article “Affect in a Behavioral Asset-Pricing Model” (Fin. Anal. J., March/April 2008: 20-29) 
reported on an experiment in which each member of a group of investors rated the risk of a com- 
pany’s stock on a 10-point scale ranging from low to high, and members of a different group rated the 
future return of the stock on the same scale. This was done for a total of 210 companies, and for each 
one both a risk score x and an expected return score y resulted from averaging responses from the 
individual raters. The following data is from a subset of ten of these companies (listed for conve- 
nience in increasing order of risk): 


x 4.3 4.6 5.2 5.3 5.5 5.7 6.1 6.3 6.8 75 
y el 5.2 79 5.8 7.2 7.0 5.3 6.8 6.6 4.7 


The scatterplot of the data for these ten companies in Figure 12.16 shows a weak (R* © .18) but also 
surprising negative relationship between the two variables. Let’s carry out the model utility test at 
significance level « = .05 (the scatterplot does not bode well for the model, but stay tuned for the rest 
of the story). 


Expected return score 
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Figure 12.16 Scatterplot for the data in Example 12.10 


The parameter of interest is 8; = true change in expected return score associated with a one-point 
increase in risk score. The null hypothesis Ho: B, = 0 will be rejected in favor of H,: B, 4 0 if the 
observed test statistic ¢ satisfies either t> t,/2n-2 = to25,3 = 2.306 or tf < —2.306. Partial Excel 
output (software not favorably regarded in the statistical community) for this example appears in 


Figure 12.17. In the output, B, = —.4913 and 5p, = .3614, so the test statistic is 


—.4913 —0 
aes mei —1.36 (also on output) 
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SUMMARY OUTPUT 
df SS MS F Significance 

Regression 1 2.0714 2.0714 1.8485 0.2110 
Residual 8 8.9646 1.1206 
Total 9 11.0360 

Coefficients _ Standard Error t Stat P-value Lower 95% Upper 95% 
Intercept 9.2353 2.0975 4.4029 0.0023 4.3983 14.0722 
Risk -0.4913 0.3614 -1.3596 0.2110 -1.3246 0.3420 


Figure 12.17 Partial Excel output for Example 12.10 


Since —1.36 does not fall in the rejection region, Ho is not rejected at the .05 level. Equivalently, the 
two-sided P-value is double the area under the fg curve to the left of —1.36, which Excel reports as 
roughly .211; since .211 > .05, again Ho is not rejected. 

Excel also provides a 95% confidence interval of (—1.3246, 0.3420) for 8;. This is consistent with 
the results of the hypothesis test: since the value 0 lies in the interval, we have no reason to reject the 
claim Ho: f, = 0. 

Is there truly no relationship, or have we committed a type II error here? With just n = 10 
observations, it is quite possible we failed to detect a relationship because hypothesis tests do not have 
much power when the sample size is small. In fact, the authors of the original study examined 210 


companies on these same two variables, resulting in an estimated slope of B , = —0.4 (similar to our 
sample) but with an estimated standard error of roughly .0556. The resulting test statistic value is 

= —7.2 at 208 df, which is highly statistically significant. The authors concluded, based on their 
larger sample, that risk is a useful predictor of future return—although, contrary to intuition, the 
association between the two appears to be negative. Even though the relationship was statistically 
significant, note that—even in the full sample—risk only accounted for R* = .185 = 18.5% of the 
variation in future returns. As in the case of previous test procedures, a large-sample size can result in 
rejection of Hp even though the data suggests that the departure from Hp is of little practical sig- 
nificance. i 


Regression and ANOVA 
The splitting of the total sum of squares SST = S,, = 50 (yi — y)’ into a part SSE which measures 
unexplained variation and a part SSR which measures variation explained by the linear relationship is 


strongly reminiscent of one-way ANOVA. In fact, Ho: 6; = 0 can alternatively be tested against 
H,: B; 4 0 by constructing an ANOVA table (Table 12.1) and rejecting Ho iff > Fyin—2. 


Table 12.1 ANOVA table for simple linear regression 


Source of variation df Sum of squares Mean square F ratio 
Regression 1 SSR MSR = SSR/1 Jf = MSR/MSE 
Error n—-2 SSE MSE = SSE/(n — 2) 


Total n—1 SST 
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The square root of the mean squared error (MSE) is s,, the residual standard deviation. The F test 
gives exactly the same result as the model utility ¢ test because r= fand Ca 20-2 = Fy1n—2. Virtually 
all computer packages that have regression options include such an ANOVA table in the output. For 
example, Figure 12.17 shows ANOVA output for the data of Example 12.10. The ANOVA table at 
the top of the output has f= 1.8485 with a P-value of .211 for the model utility test. The table of 
parameter estimates gives t = —1.3596, again with P = .211, and f° = (—1.3596)* = 1.8485 =f. Note 
that this F test is only for Ho: B, = 0 versus H,: 6, 4 0; if the alternative hypothesis is one-sided 


(> or <) or if the null value 10 is not 0, then the ¢ test must be used. 


Exercises: Section 12.3 (31-42) 


31. Reconsider the situation described in 
Example 12.1, in which x = firing velocity 
of a 7.62-mm round and y = body armor 
penetration area. Suppose the simple linear 
regression model is valid for x between 650 
and 800 m/s, and that $,;=.25 and 
o=10mm. Consider an experiment in 
which n= 7, and the x values at which 
observations are made are x, = 650, 
x2 = 675, x3= 700, X= 725, xs = 750, 
X6 = 775, and x7 = 800. 

a. Calculate Ope the standard deviation of 
ji. 

b. What is the probability that the esti- 
mated slope based on such observa- 
tions will be between .15 and .35? 

c. Suppose it is also possible to make a 
single observation at each of then = 11 
values 675, 685, 695, 705, ..., 775. Ifa 
major objective is to estimate /, as 
precisely as possible, would the 
experiment with n = 11 be preferable 
to the one with n = 7? 


32. Exercise 16 of Section 12.2 included 
Minitab output for a regression of daugh- 
ter’s height on the midparent height. 


a. Use the output to calculate a confidence 
interval with a confidence level of 95% 
for the slope f, of the population 
regression line, and interpret the 
resulting interval. 

b. Suppose it had previously been 
believed that when midparent height 
increased by | in., the associated true 
average change in the daughter’s height 
would be at least | in. Does the sample 


data contradict this belief? State and 
test the relevant hypotheses. 


33. Exercise 17 of Section 12.2 gave data on 
x = rainfall volume and y = runoff volume 
(both in m*). Use the accompanying Mini- 
tab output to decide whether there is a 
useful linear relationship between rainfall 
and runoff, and then calculate a confidence 
interval for the true average change in 
runoff volume associated with a 1-m? 
increase in rainfall volume. 


The regression equation is 


runoff =-— 1.13 + 0.827 rainfall 


Predictor Coef Stdev t ratio P 

Constant 1.128 2.368 —0.48 0.642 
Rainfall 0.82697 0.03652 22.64 0.000 
s=5.240 R-sq = 97.5% R-sq(adj) = 97.3% 


34. The invasive diatom species D. Geminata 
has the potential to inflict substantial eco- 
logical and economic damage in rivers. The 
article “Substrate Characteristics Affect 
Colonization by the Bloom-Forming Dia- 
tom Didymosphenia Geminata’? (Aquat. 
Ecol. 2010: 33-40) described an investi- 
gation of colonization behavior. One aspect 
of particular interest was whether y= 
colony density was related to x = rock 
surface area. The article contained a scat- 
terplot and summary of a regression anal- 
ysis. Here is representative data: 


x 50 71 55 50 33 58 79 
y 152 1929 48 22 2 5 35 
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a. Fit the simple linear regression model to 
this data, and then calculate and inter- 
pret the coefficient of determination. 

b. Carry outa test of hypotheses to determine 
whether there is a useful linear relation- 
ship between density and rock area. 

c. The second observation has a very 
extreme y value (in the full data set 
consisting of 72 observations, there 
were two of these). This observation 
may have had a substantial impact on 
the fit of the model and subsequent 
conclusions. Eliminate it and redo parts 
(a) and (b). What do you conclude? 


How does lateral acceleration—side forces 
experienced in turns that are largely under 
driver control—affect nausea as perceived 
by bus passengers? The article “Motion 
Sickness in Public Road Transport: The 
Effect of Driver, Route, and Vehicle” 
(Ergonomics 1999: 1646-1664) reported 
data on x = motion sickness dose (calcu- 
lated in accordance with a British standard 
for evaluating similar motion at sea) and 
y = reported nausea (%). Relevant sum- 
mary quantities are 


n= 17, 5 x= 222.1, Sy; = 193.0, 


Sex = 155.02, Syy = 783.88, Syy = 238.11 


Values of dose in the sample ranged from 

6.0 to 17.6. 

a. Assuming that the simple linear 
regression model is valid for relating 
these two variables (this is supported 
by the raw data), calculate and interpret 
an estimate of the slope parameter that 
conveys information about the preci- 
sion and reliability of estimation. 

b. Does it appear that there is a useful 
linear relationship between these two 
variables? Answer the question by 
employing the P-value approach. 
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c. Would it be sensible to use the simple 
linear regression model as a basis for 
predicting % nausea when dose = 5.0? 
Explain your reasoning. 


36. Mist (airborne droplets or aerosols) is 


generated when metal-removing fluids are 
used in machining operations to cool and 
lubricate the tool and workpiece. Mist 
generation is a concern to OSHA, which 
has substantially lowered the workplace 
standard. The article “Variables Affecting 
Mist Generation from Metal Removal Flu- 
ids” (Lubricat. Engr. 2002: 10-17) gave 
the accompanying data on x = fluid flow 
velocity for a 5% soluble oil (cm/s) and 
y =the extent of mist droplets having 
diameters smaller than 10 um (mg/m*): 


89 177 189 354 362 442 965 
40 .60 48 66 61 .69 99 


a. The investigators performed a simple 
linear regression analysis to relate the 
two variables. Does a scatterplot of the 
data support this strategy? 

b. What proportion of observed variation 
in mist can be attributed to the simple 
linear regression relationship between 
velocity and mist? 

c. The investigators were particularly 
interested in the impact on mist of 
increasing velocity from 100 to 1000 (a 
factor of 10 corresponding to the dif- 
ference between the smallest and lar- 
gest x values in the sample). When 
x increases in this way, is there sub- 
stantial evidence that the true average 
increase in y is less than .6? 

d. Estimate the true average change in 
mist associated with a | cm/s increase 
in velocity, and do so in a way that 
conveys information about precision 
and reliability. 
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37. 


38. 


39. 


40. 


41. 


12.4 


Inferences About the Regression Coefficient £, 


Refer to the data on x = iodine value and 

y = cetane number given in Exercise 19. 

a. Does the simple linear regression model 
specify a useful relationship between 
the two variables? Use the appropriate 
test procedure to obtain information 
about the P-value and then reach a 
conclusion at significance level .01. 

b. Compute a 95% CI for the expected 
change in cetane number associated 
with a 10 g increase in iodine value. 


Carry out the model utility test using the 
ANOVA approach for the filtration rate— 
moisture content data of Example 12.6. 
Verify that it gives a result equivalent to 
that of the ¢ test. 


Use the rules of expected value to show that 
Bo is an unbiased estimator for fy (making 
use of the fact that B , is unbiased for f;). 


a. Verify that E(B,) = B, by using the rules 
of expected value from Chapter 5. 
b. Use the rules of variance from Chapter 5 


to verify the expression for V(f,) given 
in this section. 
Verify that if each x; is multiplied by a 
positive constant c and each y; is multiplied 
by another positive constant d, the ¢ statistic 
for testing Ho: 6, = 0 versus H,: 8, # 0 is 


unchanged in value. [Note: The value of B 1 


42. 


Inferences for the (Mean) Response 
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will change, which shows that the magni- 


tude of B, is not by itself indicative of 
model utility.] 


The power for the ¢ test for Ho: By = Bio 
can be computed in the same manner as it 
was computed for the f tests of Chapter 9, 
using the noncentral ¢ distribution. If the 
alternative value of , is denoted by f, the 
required noncentrality parameter is 


Bi — Bro 


— 6/VSxx 


and power is calculated based on n — 2 df. 
An article in the Journal of Public Health 
Engineering reports the results of a regres- 
sion analysis based on n = 15 observations 
in which x = filter application temperature 
(°C) and y = % efficiency of BOD removal. 
(BOD stands for biochemical oxygen 
demand, and it is a measure of organic 
matter in sewage.) Calculated quantities 
include S,, = 324.4, s.= 3.725, and 
B, = 1.7035. Consider testing at signifi- 
cance level .01 the hypothesis Ho: f; = 1, 
which states that the expected increase in % 
BOD removal is | when filter application 
temperature increases by | °C, against the 
alternative H,: B,; > 1. Determine power 
when f, = 2, ¢ = 4. 


Throughout this section we will let Y denote the statistic 


Y = Bot Bix" 


with observed value y, where x* denotes a specified value of the explanatory variable x. Once the 


estimates Bo and B , have been calculated, ¥ can be regarded either as a point estimate of ly),- (the 


expected or true average value of Y when x = x*) or as a prediction of the Y value that will result from 
a single new observation made when x = x*. The point estimate or prediction by itself gives no 
information concerning how precisely jy|,- has been estimated or Y has been predicted. This can be 


remedied by developing a CI for y,. and a prediction interval (PI) for a single future Y value. 
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Before we obtain sample data, both Bo and B , are subject to sampling variability—that is, they are 
both statistics whose values will vary from sample to sample. This variability was shown in 
Figure 12.13 at the beginning of Section 12.3. Suppose, for example, that By = 50 and B, = 2. Thena 
first sample of (x, y) pairs might give Bo = 52.35 and ie = 1.895, a second sample might result in 


Bo = 46.52 and B 1 = 2.056, and so on. It follows that Y itself varies in value from sample to sample. 
If the intercept and slope of the population line are the aforementioned values 50 and 2, respectively, 
and x* = 10, then this statistic is trying to estimate the value [ly\9 = 50 + 2(10) = 70. The estimate 
from a first sample might be ~ = 52.35 + 1.895(10) = 71.30, from a second sample might be 
y = 46.52 + 2.056(10) = 67.08, etc. In the same way that a confidence interval for 6, was based on 


properties of the sampling distribution of B 1, aconfidence interval for a mean y value in regression is 
based on properties of the sampling distribution of the statistic Y. 

Substitution of the expressions for Bo and B , into Y, followed by some algebraic manipulation, 
leads to the representation of Y as a linear function of the Y,’s: 


—_ Te 3 “11 @-2@-D), < 
P= f+ fy == + Say = Soa 
where d; = (1/n) + (x* — X)(%; — X)/Sxx 
The coefficients d,, dz, ..., d, in this linear function involve the x;’s and x*, all of which are fixed. 


Application of the rules of Section 5.3 to this linear function gives the following properties. 
(Exercise 52 requests a derivation of Property 2.) 


SAMPLING Let Y = Bo + Bix’, where x* is some fixed value of x. Then 
DISTRIBUTION OF Y 
1. The mean value of Y is 


E(Y) = E(By + Bix") = Bo + Bix” = Hy |x 


Thus Bot Bix* is an unbiased estimator for fo + Byx* (i.e., for 
Hy|y«)- 


2. The variance of Y is 


V(Y) Soe=o 


and the standard deviation a; is the square root of this expression. The 


estimated standard deviation of Y , denoted by s; or Se +h results 


xt? 


from replacing o by its estimate s,: 
a SB + Bixt = Se h + 


3. Y has anormal distribution, because it is a linear function of the Y;’s 
which are normally distributed and independent. 
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The variance of Y is smallest when x* = X and increases as x* moves away from X in either direction. 
Thus the estimator of jy,» is more precise when x" is near the center of the x;’s than when it is far 
from the x values where observations have been made. This implies that both the CI and PI are 


narrower for an x* near X than for an x* far from X. Most statistical computer packages provide both Y 
and sy for any specified x* upon request. 


Inferences Concerning the Mean Response 
Just as inferential procedures for 6; were based on the ¢ variable obtained by standardizing, 


a t variable obtained by standardizing y= Bo + B :x* leads to a CI and test procedures here. 


THEOREM The variable 


y— By _ Bo + Bix* — (Bo + Bix") 


T => 
Sho ix" 


(12.6) 


has a f¢ distribution with n — 2 df. 


As was the case for f,; in the previous section, a probability statement involving this standardized 
variable can be manipulated to yield a confidence interval for fy,-. 


A 1000. — «)% CI for py,.«, the mean/expected value of ¥Y when x = x*, has endpoints 


C by/2.n—2 Sp = (Bo ae Bix") a 


(12.7) 


=> 
uu 


This CI is centered at the point estimate for jy),- and extends out to each side by an amount that 


depends on the confidence level and on the extent of variability in the estimator on which the point 
estimate is based. 


Example 12.11 Refer back to the Omaha apartment data of Example 12.9, where the response 
variable was monthly rent and the predictor was square footage. Let’s now calculate a confidence 
interval, using a 95% confidence level, for the mean rent for all 1200 ft.? two-bedroom apartments in 
Omaha—that is, a confidence interval for 4yj1200 = Bo + 6,(1200). The interval is centered at 


5 = By + B, (1200) = 470.6 + .5237(1200) = $1099.04 


The estimated standard deviation of the statistic Y at x = x* = 1200 is 


1 (x —3x) 1 (1200 — 1023.5)? 
<= 75. = 12.807 
eee 2 o|2 * "1,991,569 = 


The 75 df t critical value for a 95% confidence level is 1.992, from which we determine the desired 
interval to be 
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1099.04 + 1.992(12.807) = (1073.54, 1124.57) 


At the 95% confidence level, we estimate that the average monthly rent of all 1200 square foot, two- 
bedroom apartments in Omaha is between $1073.54 and $1124.57. Remember that if we re- 
calculated this interval for sample after sample, in the long run about 95% of the calculated intervals 
would include fo + 6:(1200). We hope that this true mean value lies in the single interval that we 
have calculated. 

For the population of all two-bedroom apartments in Omaha of size 1050 square feet, similar 
calculations result in y = 1020.50, s; = 8.742, and 95% CI = (1003.08, 1037.91). Notice that not 
only is the expected rent lower for 1050 ft.? apartments than for 1200 ft? apartments (no surprise 
there), but the estimated standard error is also smaller. That’s because x* = 1050 is closer to the 
sample mean of X = 1023.5 square feet than is x* = 1200. 

Figure 12.18 shows a JMP scatterplot with the LSRL and curves corresponding to the confidence 
limits for each different x value. 
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Figure 12.18 JMP scatterplot with confidence limits for the data of Example 12.11 a 


In some situations, a CI is desired not just for a single x value but for two or more x values, and we 
must proceed with caution in interpreting the confidence levels of our intervals. For example, in 
Example 12.11 two 95% CIs were constructed, one for f4yj1299 and another for ty1950. The joint or 
simultaneous confidence level—the long-run proportion of time under repeated sampling that both 
CIs would contain their respective parameters—is less than 95%. While it is difficult to determine the 
exact simultaneous confidence level, Bonferroni’s inequality (Chapter 8, Exercise 91) established that 
if two 100(1 — «)% CI’s are computed, then the joint confidence level of the resulting pair of 
intervals is at least 100(1 — 2«)%. Thus, in Example 12.11 we are at least 90% confident (because 
a# = .05 and 1 — 2a=1 — .10=.90) that the two statements 1073.54 < flyjj200 < 1124.57 and 
1003.08 < Ly1050 < 1037.91 are both true. 
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More generally, a set of m intervals each with confidence level 100(1 — «)% is guaranteed to have 
simultaneous confidence of at least 100(1 — ma)%. This relationship can be reversed to achieve a 
desired joint confidence level: replacing « with «/m, if individual 100(1 — «/m)% CIs are constructed 
for each of m parameters, the resulting simultaneous confidence level is at least LOO(1 — «)%. For 
example, if we desired intervals for both ftyj1209 and fyj1050 with (at least) 95% joint confidence, then 
each individual CI should be calculated using the ¢ critical value corresponding to confidence 
coefficient 1 — a/m = 1 — .05/2 = .975. 

Tests of hypotheses about jy),. are based on the test statistic T obtained by replacing j1y),. with the 
null value ju in the numerator of (12.6). For example, the assertion Ho: fy\129) = $1100 in Example 
12.11 says that the mean rent for all 1200 ft.? apartments in the population is $1100 per month. The 
test statistic value is then t = (Jj; — 1100)/s,, and the test is upper-, lower-, or two-tailed according to 
the inequality in H,. 


A Prediction Interval for a Future Value of Y 
Analogous to the CI (12.7) for py),-, one frequently wishes to obtain an interval of plausible values 
for the value of Y associated with a single future observation when the explanatory variable has value 
x*. In Example 12.11, a CI was computed for the true mean rent of all apartments of a certain size, but 
an individual renter will be more interested in knowing a realistic range of rent values for a single 
such apartment. 

A Cl estimates a parameter, or population characteristic, whose value is fixed but unknown to us. 
In contrast, a future value of Y is not a parameter but instead a random variable; for this reason we 
refer to an interval of plausible values for a future Y as a prediction interval (PI) rather than a 
confidence interval. (Section 8.2 presented a method for constructing a one-sample ¢ prediction 
interval for a single future value of a variable.) 

When estimating /ly|,-, the estimation error, y= Hy|x+, 18 the difference between a random variable 


and a fixed but unknown quantity. In contrast, the prediction error is Y— Y = Y — (By) + B;x* +8), a 
difference between two random variables. With the additional random ¢é term, there is more uncer- 
tainty in prediction than in estimation. As a consequence, a PI will be wider than a CI for the same x* 
value. Because the future value Y is independent of the observed Y,’s that determine Y, 


V(Y — Y) = variance of prediction error 
=V(Y)+V(Y) independence 
=V(Y)+V(e) because By + B,x* is a constant 


t. Gax? 
S| OEE) ozs 
n Sex 

1. Ge =x 
=f te eas 
n Sixx 


Furthermore, because E(Y) = By + B;x* and E(Y) = By + B;x*, the expected value of the prediction 
error is E(Y — Y) = 0. It can then be shown that the standardized variable 


Y-Y 


I Gta ay 
n Sixx 


r= 
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has a ¢ distribution with n — 2 df. Substituting this expression for T into the probability statement 
P(tyj2.n-2 < T < ty2,n-2) = 1 — % and manipulating to isolate Y between the two inequalities yields 
the following interval. 


A 10001 — «)% PI for a future Y observation to be made when x = x* has endpoints 


x / “ Ag 1 x* —x)° 
yu ty/2,n-2 . x +55 = (Bo Sa Bx ) = by /2,n—2 : si + a ae way (12.8) 


The interpretation of the prediction level 100(1 — «)% is identical to that of previous confidence 
levels—if (12.8) is used repeatedly, in the long run the resulting intervals will actually contain the 
observed future y values 100(1 — «)% of the time. Notice that the 1 underneath the square root 
symbol makes the PI (12.8) wider than the CI (12.7), although the intervals are both centered at y. 
Also, as n — oo the width of the CI approaches 0, whereas the width of the PI approaches 2z,/.0 
(because even with perfect knowledge of fo and f,, there will still be uncertainty in prediction). 


Example 12.12 (Example 12.11 continued) Let’s calculate a 95% prediction interval for the 
monthly rent of a single 1200 square foot, two-bedroom apartment in Omaha. Relevant quantities 
from Example 12.11 are 


y= 1099.04 s; = 12.807 se = 75.69  to25,75 = 1.992 


The prediction interval is then 


1099.04 + 1.992,/75.692 + 10.29? = 1099.04 + 1.992(76.386) = (946.13, 1251.97) 


Plausible values for the monthly rent of a 1200 ft.* apartment are, at the 95% prediction level, 
between $946.13 and $1251.97. The 95% confidence interval for the mean rent of all such apartments 
was (1073.54, 1124.57). The prediction interval is much wider than this because of the extra 75.697 
under the square root. Since apartments of the same size will vary in rent—the estimated sd of rents 
for apartments of any fixed size is s, = $75.69—there is necessarily much greater uncertainty in the 
cost of a single apartment than in the average cost of all such apartments. a 


The Bonferroni technique can be employed as in the case of confidence intervals. If a PI with 
prediction level 100(1 — «/m)% is calculated at each of m different x* values, the simultaneous or joint 
prediction level for all m intervals will be at least 100(1 — «)%. 


Exercises: Section 12.4 (43-52) 


43. Global warming is a major issue, and CO, describes the results of growing pine trees 


emissions are an important part of the dis- 
cussion. The article “Effects of Atmo- 
spheric CO, Enrichment on Biomass 
Accumulation and Distribution in Eldarica 
Pine Trees” (J. Exp. Bot. 1994: 345-349) 


with increasing levels of CO, in the air. 
Here are the observations with x = atmo- 
spheric concentration of CO, (parts per 
million) and y = mass in kilograms after 
11 months of the experiment. 


x 


y 


44. 


45. 


12.4 Inferences for the (Mean) Response 


408 408 554 554 680 680 812 812 
1.1 1.3 1.6 2.5 3.0 4.3 4.2 4.7 


Software calculates s, = 534; jy = 2.723 
and sy = .190 when x= 600; and ~= 
3.992 and s; = .256 when x = 750. 


a. Explain why s> is larger when x = 750 
than when x = 600. 

b. Calculate a confidence interval with a 
confidence level of 95% for the true 
average mass of all trees grown with a 
CO, concentration of 600 parts per 
million. 

c. Calculate a prediction interval with a 
prediction level of 95% for the mass of a 
tree grown with a CO, concentration of 
600 parts per million. 

d. If a 95% CI is calculated for the true 
average mass when CO, concentration 
is 750, what will be the simultaneous 
confidence level for both this interval 
and the interval calculated in part (b)? 


Reconsider the filtration rate—moisture 

content data introduced in Example 12.6. 

a. Compute a 90% CI for fo + 125f;, 
true average moisture content when the 
filtration rate is 125. 

b. Predict the value of moisture content 
for a single experimental run in which 
the filtration rate is 125 using a 90% 
prediction level. How does this interval 
compare to the interval of part (a)? 
Why is this the case? 

c. How would the intervals of parts 
(a) and (b) compare to a CI and PI 
when filtration rate is 115? Answer 
without actually calculating these new 
intervals. 

d. Interpret both Ho: Bo + 1258, = 80 and 
H,: fo + 125$, < 80, and then carry 
out a test at significance level .01. 


Astringency is the quality in a wine that 
makes the wine drinker’s mouth feel 
slightly rough, dry, and puckery. The paper 
“Analysis of Tannins in Red Wine Using 
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Multiple Methods: Correlation with Per- 
ceived Astringency” (Amer. J. Enol. Vitic. 
2006: 481-485) reported on an investiga- 
tion to assess the relationship between 
perceived astringency and tannin concen- 
tration using various analytic methods. 
Here is data provided by the authors on 
x = tannin concentration by protein pre- 
cipitation and y = perceived astringency as 
determined by a panel of tasters. 


x 0.718 0.808 
y 0.428 0.480 


x 0.766 0.470 
y 0.326 —0.336 


x 0.674 0.858 


y 0.126 0.305 — 


x 0.907 0.638 


0.924 1.000 
0.493 0.978 


0.726 0.762 
0.765 0.190 


0.406 0.927 
0.577 0.779 


0.234 0.781 


y 1.007 —0.090 —1.132 0.538 


0.667 0.529 0.514 
0.318 0.298 —0.224 


0.666 0.562 0.378 
0.066 —0.221 —0.898 


0.311 


0.319 


0.518 


0.559 
0.198 


0.779 
0.836 


0.687 


0.707 


0.326 


0.610 


0.433 


0.648 


0.319 


1.098 


0.581 


0.862 


0.145 


0.238 
0.551 


Relevant 
follows: 


summary quantities are as 


Yo xi = 19.404, Soy = -.549, Sex = 1.48193150, 
Syy = 11.82637622, Sy = 3.83071088 


a. Fit the simple linear regression model 
to this data. Then determine the pro- 
portion of observed variation in astrin- 
gency that can be attributed to the 
model relationship between astringency 
and tannin concentration. 

b. Calculate and interpret a confidence 
interval for the slope of the true 
regression line. 

c. Estimate true average astringency when 
tannin concentration is .6, and do so in 
a way that conveys information about 
reliability and precision. 

d. Predict astringency for a single wine 
sample whose tannin concentration is .6, 
and do so in a way that conveys infor- 
mation about reliability and precision. 

e. Is there compelling evidence for con- 
cluding that true average astringency is 
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46. 


47. 


48. 


49. 


positive when tannin concentration is 
.7? State and test the appropriate 
hypotheses. 


The simple linear regression model pro- 

vides a very good fit to the data on rainfall 

and runoff volume given in Exercise 17 of 

Section 12.2. The equation of the least 

squares line is y= —1.128+ .82697x, 

R* = .975, and s, = 5.24. 

a. Use the fact that s; = 1.44 when rain- 
fall volume is 40 m* to predict runoff in 
a way that conveys information about 
reliability and precision. Does the 
resulting interval suggest that precise 
information about the value of runoff 
for this future observation is available? 
Explain your reasoning. 

b. Calculate a PI for runoff when rainfall 
is 50 using the same prediction level as 
in part (a). What can be said about the 
simultaneous prediction level for the 
two intervals you have calculated? 


A simple linear regression is performed on 
y=salary ($1000s) and x= years of 
experience for actuaries. You are told that a 
95% CI for the mean salary of actuaries 
with five years of experience, based on a 
sample of n = 10 observations, is (92.1, 
117.7). Calculate a CI with confidence level 
99% for the mean salary of actuaries with 
five years of experience. 


Refer to Exercise 19 in which x = iodine 
value in grams and y = cetane number for a 
sample of 14 biofuels. 

a. Software gives sy = .802 when x = 80 
and sy; = 1.074 when x = 120. Explain 
why one is much larger than the other. 

b. Calculate a 95% CI for expected cetane 
number when the iodine value is 80 g. 

c. Calculate a 95% PI for the cetane 
number of a single biofuel with iodine 
value 120 g. 

The article “Optimization of HVAC Con- 

trol to Improve Comfort and Energy Per- 

formance in a School” (Energy Engr. 2008: 

6-22) gives an analysis of the electrical and 
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gas costs for a high school in Austin, Texas 
after a new heating and air conditioning 
(HVAC) system was installed. The 
accompanying data on x = average outside 
air temperature (°F) and y = electricity 
consumption (kWh) for a sample of 
n = 20 months was read from a graph in 
the article. 


48 53 56 58 58 
8200 7600 8000 10000 10400 
59 59 60 68 69 
10200 11000 9500 9800 8500 

69 70 73 i 79 


11100 11800 12000 12200 11100 


80 80 84 87 88 
11400 13600 10000 14000 12500 


Summary quantities include x = 68.65, S\, = 
2692.55, y= 10645,  S,,, = 60,089,500, 
Syy = 303,515, By = 2906, 8, = 112.7. 


a. Does the simple linear regression 
model specify a useful relationship 
between outside temperature and elec- 
tricity consumption? 

b. Estimate the true change in expected 
energy consumption associated with a 
1 °F increase in outside temperature 
using a 95% confidence interval, and 
interpret the interval. 


c. Calculate a 95% CI for pyj70, the true 
average monthly energy consumption 
when temperature = 70 °F. 


d. Calculate a 95% PI for a single future 
observation on energy consumption to 
be made when temperature = 70 °F. 


e. Would the 95% CI and PI when tem- 
perature = 85 °F be wider or narrower 
than the corresponding intervals of 
parts (c) and (d)? Answer without 
actually computing the intervals. 


f. Would you recommend calculating a 


95% PI for an outside temperature of 
95 °F? Explain. 


12.4 Inferences for the (Mean) Response 745 


g. Calculate simultaneous CI’s for true Board 2016) gives the following data on 
average monthly energy consumption x = shear stress (lb/ft?) and y = erosion 
when outside temperature is 60, 70, and depth (ft) for six experimental test trays at 
80 °F, respectively. Your simultaneous Colorado State University, built to re-create 
confidence level should be at least 97%. real stream conditions. 


50. Consider the following four intervals based = 0.75 1.50 1.70 1.61 2.43 3.24 
on the data of the previous exercise: y 01 06 10 03 13 24 
e A 95% CI for energy consumption when 


temp = 60 a. Construct a scatterplot. Does the simple 


e A 95% PI for energy consumption when linear regression model appear to be 
temp = 60 plausible? 
e A 95% CI for energy consumption when - 
temp = 72 b. Carry out a test of model utility. 
e A 95% PI for energy consumption when c. Estimate true average erosion depth 
temp = 72 when shear stress is 1.75 Ib/ft* by 
Without computing any of these intervals, giving an interval of plausible values. 
what can be said about their widths relative d. Estimate erosion depth along a single 
to each other? stream where water flow creates a shear 
51. Many parts of the USA are experiencing stress of 1.75 Ib/ft” by giving an inter- 
increased erosion along waterways due to val of plausible values. 
increased flow from global climate change. 52. Verify that V( Up +4 B, x) is indeed given by 
The report “Evaluation and Assessment of the expression in the text. [Hint: 
Environmentally Sensitive Stream Bank V(id¥i) = Nd? V(¥i).] 


Protection Measures” (Transp. Resour. 


12.5 Correlation 


In many situations, the objective in studying the joint behavior of two variables is simply to see 
whether they are related, rather than to use one to predict the value of the other. In this section, we 
first develop the sample correlation coefficient r as a measure of how strongly related two variables 
x and y are in a sample, and then we relate r to the correlation coefficient p defined in Chapter 5. 


The Sample Correlation Coefficient r 

Given n pairs of observations (x1, y1),.--,(%n,Yn), it is natural to speak of x and y having a positive 
relationship if large x’s are paired with large y’s and small x’s with small y’s. Similarly, if large x’s are 
paired with small y’s and vice versa, then a negative relationship between the variables is implied. 
Consider standardizing each x value in the sample, i-e., replacing each x; with (x; — x) /s,. Now do the 
same thing with the y,’s to obtain the standardized y values (y; — y) /sy. Our proposed measure of the 
direction and strength of the relationship between the x’s and y’s involves the sum of the products of 
these standardized values. 
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DEFINITION The sample correlation coefficient for the n pairs (x,,y,),.--;(%n,¥n) is 


1 "(xi — yey Sry Sry 
= = Y = - 12.9 
Te es 2 ( Sx ) ( Sy ) (n—1)8x8y J Suxa/Syy ( ) 


The denominator of Expression (12.9) is clearly positive, so the sign of r (+ or —) is determined by the 
numerator S,,. If the relationship between x and y is strongly positive, an x; above the mean X will tend 
to be paired with a y; above the mean Y, so that (x; — X)(y; — ¥) > 0, and this same product will also 
be positive whenever both x; and y; are below their respective means (a negative times a negative 
equals a positive). Thus a positive relationship implies that S,,, will be positive. An analogous 
argument shows that when the relationship is negative, S, will be negative, since most of the 
products (x; — X)(y; — y) will be negative. This is illustrated in Figure 12.19. 
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Figure 12.19 (a) Scatterplot with r and S,, positive; (b) scatterplot with r and S,, negative [+ means 
(xi — X)(yi — ¥) > 0, and — means (x; — X)(y; — y) <0] 


The most important properties of r are as listed below. 


PROPERTIES OF r 1. The value of r does not depend on which of the two variables is labeled 
x and which is labeled y. 


2. The value of r is independent of the units in which x and y are mea- 
sured. In particular, 7 itself is unitless. 


3. The square of the sample correlation coefficient gives the value of the 
coefficient of determination that would result from fitting the simple 
linear regression model—in symbols, r=R’. 

4,.-l<r<l. 


5. r= +1 if and only if all (;, y;) pairs lie on a straight line. 


Proof Property 1 should be evident. Property 2 is a direct result of standardizing the two variables; 
Exercise 64 asks for a formal verification. To prove Property 3, recall that R° can be expressed as the 
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ratio SSR/SST, where SSR = J (5; — y)? and SST = S,, = > (y; — y)’. It is easily shown (see 


Exercise 24(b)) that $; — ¥ = 6, (x; — X), and therefore 


sR = 0 i- = Dw -9*= (2 
2 

_ SSR Siy/Sux _ Sy _\_» 

SST By S/S) 


Because 1° = R* = SSR/SST = (SST — SSE)/SST, and the numerator cannot be bigger than the 
denominator, r must be between —1 and 1. Furthermore, because the ratio can be | if and only if 
SSE = 0, we conclude that 7? = 1 (i.e., r = £1) if and only if all the points fall on a straight line. I 


mm 
a 
. 
| 

Ls 
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Property 1 stands in marked contrast to what happens in regression analysis, where virtually all 
quantities of interest (the estimated slope, estimated y-intercept, s,, etc.) depend on which of the two 
variables is treated as the response variable. However, Property 3 shows that the proportion of 
variation in the response variable explained by fitting the simple linear regression model does not 
depend on which variable plays this role. 

Property 2 is equivalent to saying that r is unchanged if each x; is replaced by cx; and if each y; is 
replaced by dy; (where c and d are positive, giving a change in the scale of measurement), as well as if 
each x; is replaced by x; — a and y; by y; — b (which changes the location of zero on the measurement 
axis). This implies, for example, that r is the same whether temperature is measured in °F or °C. 

Property 4 tells us that the maximum value of r, corresponding to the largest possible degree of 
positive relationship, is r = 1, whereas the most negative relationship is identified with r= —1. 
According to Property 5, the largest positive and largest negative correlations are achieved only when 
all points lie along a straight line. Any other configuration of points, even if the configuration suggests 
a deterministic relationship between variables, will yield an r value less than 1 in absolute magnitude. 
Thus, r measures the degree of linear relationship among variables. A value of r near 0 is not 
necessarily evidence of a lack of a strong relationship, but only the absence of a linear relation, so that 
such a value of r must be interpreted with caution. Figure 12.20 illustrates several configurations of 
points associated with different values of r. 
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apparent relationship 


Figure 12.20 Data plots for different values of r 


relationship 
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Example 12.13 The article “A Cross-National Relationship Between Sugar Consumption and Major 
Depression?” (Depression and Anxiety 2002: 118-120) reported the following data on x = daily 
sugar consumption (calories per capita) and y = annual rate of major depression (cases per 100 
people) for a sample of six countries. 


Country Sugar consumption Depression rate 
USA 300 3.0 
Canada 390 5.2 
France 350 4.4 
Germany 375 5.0 
New Zealand 480 5.7 
South Korea 150 2.3 


With n = 6, X = 340.8, s, = 110.6, ¥ = 4.267, and s, = 1.338, 


— 1 300 — 340.8\ (3.0 — 4.267 ie 4 150 — 340.8) (2.3 —4.267\ | | 944 
~ 6-1 110.6 1.338 110.6 1.338 - 


Equivalently, S,. = 61,120.83, S,, = 8.953, S,, = 698.667, and r = Syy/./Sxx * Syy = .944. Since the 
correlation coefficient is positive and close to 1, the data indicates a strong, positive relationship 
between sugar consumption and depression rate, at least for these six countries. A scatterplot of this 
data (not shown) also supports the notion of a strong, positive association. Note that if sugar con- 
sumption was converted into grams per capita (a gram of sugar has about 4 calories, so x, = x,/4), the 
summary values for the x data would change but r would remain .944. 

Does this study show that increased sugar consumption causes depression? Would forcing people 
in these countries to eat less sugar reduce the depression rate? Not necessarily: the high r value 
establishes a strong association between the two variables, but (as discussed in earlier chapters) 
association does not imply causation. Other factors not explored by the investigators may explain 
why nations with greater sugar consumption report higher depression rates. It should also be noted 
that aggregating data—here, looking at data on the national rather than individual level—tends to 
inflate the correlation coefficient by averaging out individual variation that would weaken the 
apparent relationship between the two variables. a 


Correlation and the Regression Effect 
The correlation coefficient can be used to obtain an alternative expression for the equation of the least 
squares regression line: 


A a = Sy 2 
y= Bot Bix=¥rr- = (x—X) 
x 
(Exercise 66 requests a derivation of this result.) This expression for the regression line can be 
interpreted as follows. Suppose r > 0. For an x that lies one standard deviation (s, units) above the 
mean X of the x;’s, the predicted y value is y+r-s,, r standard deviations above the mean on the 
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y scale. If r is negative, the LSRL predicts that the y value when x is one sd above average will be 
r sd’s below average. Critically, since the magnitude of r is typically strictly less than 1, our model 
predicts that, on a standardized scale, the response variable will be closer to its mean than the 
explanatory variable is to its mean. 

The term regression analysis was first used by Francis Galton in the late nineteenth century in 
connection with his work on the relationship between father’s height x and son’s height y. After 
collecting a number of pairs (x;, y;), Galton used the principle of least squares to obtain the equation of 
the LSRL with the objective of using it to predict son’s height from father’s height. In using the 
derived line, Galton found that if a father was above average in height, his son was also expected to 
be above average in height, but not by as much as the father. Similarly, the son of a shorter-than- 
average father was expected to be shorter than average, but not by as much as the father. Thus the 
predicted height of a son was “pulled back in” toward the mean; because regression can be defined as 
moving backward, Galton adopted the terminology regression line. This phenomenon of being pulled 
back in toward the mean has been observed in many other situations (e.g., a player’s batting averages 
from year to year in baseball) and is called the regression effect or regression to the mean. See also 
Section 5.5 for a discussion of this topic in the context of the bivariate normal distribution. 

Because of the regression effect, care must be exercised in experiments that involve selecting 
individuals based on below-average scores. For example, if students are selected because of below- 
average performance on a test and they are then given special instruction, the regression effect 
predicts improvement even if the instruction is useless. A similar warning applies in studies of 
underperforming businesses or hospital patients. 


The Population Correlation Coefficient » and Inferences About Correlation 

The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. 
We can think of the pairs (x;, y;) as having been drawn from a bivariate population of pairs, with 
(X;, Y;) having some joint probability distribution f(x, y). In Chapter 5, we defined the correlation 
coefficient p(X, Y) by 


p= ott r) =e (%=M) (Yam) _ B= w(t =) 


Ox 


If we think of f(x, y) as describing the distribution of pairs of values within the entire population, p 
becomes a measure of how strongly related x and y are in that population. Properties of p analogous to 
those for 7 were given in Chapter 5. 

The population correlation coefficient p is a parameter or population characteristic, just as Lx, Ly, 
Ox, and cy are, and we can use the sample correlation coefficient to make various inferences about p. 
In particular, r is a point estimate for p, and the corresponding estimator is 


bien e) eee 


Many of the intervals and test procedures presented in Chapters 8—10 were based on an assumption of 
population normality. To test hypotheses about p, we must make an analogous assumption about the 
distribution of pairs of (x, y) values in the population. We are now assuming that both X and Y are 
random, with joint distribution given by the bivariate normal pdf introduced in Section 5.5. 

If X = x, recall from Section 5.5 that the conditional distribution of Y is normal with mean 
My|x = Hy + (po2/01)(x — fy) and variance (1 — p’)o3. This is exactly the model used in simple 
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linear regression with By = My — puyo2/01, 8; = por/o1, and o* = (1 —*)o> independent of 
x. The implication is that if the observed pairs (x; y;) are actually drawn from a bivariate normal 
distribution, then the simple linear regression model is an appropriate way of studying the behavior 
of Y for fixed x. If p = 0, then fy, = [2 independent of x; in fact, when p = 0 the joint pdf f(x, y) can 
be factored into a part involving x only and a part involving y only, which implies that X and Y are 
independent random variables. 


Example 12.14 As discussed in Section 5.5, contours of the bivariate normal distribution are 
elliptical, and this suggests that a scatterplot of observed (x, y) pairs from such a joint distribution 
should have a roughly elliptical shape. The article “Methods of Estimation of Visceral Fat: Advan- 
tages of Ultrasonography” (Obesity Res. 2003: 1488-1494) includes the scatterplot in Figure 12.21 
for x = visceral fat (cm?) measured by ultrasound (US) versus y = visceral fat by computerized 
tomography (CT) for a sample of n = 100 obese women. CT is considered the most accurate tech- 
nique for body fat measurement but is costly, time-consuming, and involves exposure to ionizing 
radiation; the US method is noninvasive and less expensive. 


Fat by CT 


Fat by US 


Figure 12.21 Scatterplot for Example 12.14 


The pattern in the scatterplot in Figure 12.21 seems consistent with an assumption of bivariate 
normality. If we let p denote the true population correlation coefficient between CT and US mea- 
surements, then a point estimate of p is p = r = .71, a value given in the article. Of course we would 
want fat measurements from the two methods to be very highly correlated before regarding one as an 
adequate substitute for the other. By that standard, r = .71 is not all that impressive, but the inves- 
tigators reported that a test of Hp: p = 0 (to be introduced shortly) gives P-value < .001. Bi 


Assuming that the pairs are drawn from a bivariate normal distribution allows us to test hypotheses 
about p and to construct a CI. There is no completely satisfactory way to check the plausibility of the 
bivariate normality assumption. A partial check involves constructing two separate normal probability 
plots, one for the sample x;’s and another for the sample y,’s, since bivariate normality implies that the 
marginal distributions of both X and Y are normal. If either probability plot deviates substantially from 
a straight-line pattern, the following inferential procedures should not be used when the sample size 
n is small. Also, as in Example 12.14, the scatterplot should show a roughly elliptical shape. 
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TESTING FOR THE When Ay: p = 0 is true, the test statistic 
ABSENCE OF 
CORRELATION RV¥n—2 


Vv1—R? 


has a ¢ distribution with n — 2 df (see Exercise 63). 


Alternative Hypothesis Rejection Region for Level « Test 


H,: p>0 t > tyn—2 
A p< 0 t< by n—2 
Hy: p #0 either f > tyon—2 Ot < —ty2n-2 


A P-value based on n — 2 df can be calculated as described previously. 


Example 12.15 Neurotoxic effects of manganese are well known and are usually caused by high 
occupational exposure over long periods of time. In the fields of occupational hygiene and envi- 
ronmental hygiene, the relationship between lipid peroxidation, which is responsible for deterioration 
of foods and damage to live tissue, and occupational exposure had not been previously reported. The 
article “Lipid Peroxidation in Workers Exposed to Manganese” (Scand. J. Work Environ. Health 
1996: 381-386) gave data on x = manganese concentration in blood (ppb) and y = concentration 
(umol/L) of malondialdehyde, which is a stable product of lipid peroxidation, both for a sample of 22 
workers exposed to manganese and for a control sample of 45 individuals. The value of r for the 
control sample was .29, from which 


(.29) 45 —2 _ 
Ai 2 


The corresponding P-value for a two-tailed ¢ test based on 43 df is roughly .052 (the cited article 
reported only that the P-value > .05). We would not want to reject the assertion that p = 0 at either 
significance level .01 or .05. For the sample of exposed workers, r = .83 and t = 6.7, clear evidence 
that there is a positive relationship in the entire population of exposed workers from which the sample 
was selected. Although in general correlation does not necessarily imply causation, it is plausible here 
that higher levels of manganese cause higher levels of peroxidation. o 


t= 


Because p measures the extent to which there is a linear relationship between the two variables in 
the population, the null hypothesis Hp: p = O states that there is no such population relationship. In 


Section 12.3, we used the ¢ ratio B, / 5p, to test for a linear relationship between the two variables in 


the context of regression analysis. It turns out that the two test procedures are completely equivalent 


because r/n — 2/V1— 7? = Bi/sp, (Exercise 63). 


Other Inferences Concerning p 
The procedure for testing Ho: p = Po when po ¥ 0 is not equivalent to any procedure from regression 
analysis. The test statistic is based on a transformation of R called the Fisher transformation. 
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PROPOSITION When (X), Y;), ..., (Xn, Y,) is a sample from a bivariate normal distribution, 
the rv 


The rationale for the transformation is to obtain a function of R that has a variance independent of p; 
this would not be the case with R itself. The approximation will not be valid if n is quite small. 


The test statistic for testing Hp: p = Py is 


Vv sind +A)/( p,)] 


Z= 
1/Vn-3 
Alternative Hypothesis Rejection Region for Level a Test 
Ala: p> Py Z2 Za 
Aa: p <p, ZS—Za 
Aa: pF p, either z > Za/2 or Zz <—Zal2 


A P-value can be calculated in the same manner as for previous z tests. 


Example 12.16 As far back as Leonardo da Vinci, it was known that height and wingspan (mea- 
sured fingertip to fingertip between outstretched hands) are closely related. For these measurements 
(in inches) from 16 students in a statistics class notice how close the two values are. 


Student 1 2 3 4 5 6 7 8 
Height 63.0 63.0 65.0 64.0 68.0 69.0 71.0 68.0 
Wingspan 62.0 62.0 64.0 64.5, 67.0 69.0 70.0 72.0 
Student 9 10 11 12 13 14 15 16 
Height 68.0 72.0 73.0 73.5 70.0 70.0 72.0 74.0 
Wingspan 70.0 72.0 73.0 75.0 71.0 70.0 76.0 76.5 


The scatterplot in Figure 12.22 shows an approximately linear shape, and the point cloud is 
roughly elliptical. Also, the normal plots for the individual variables are roughly linear, so the 
bivariate normal distribution can reasonably be assumed. 
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Wingspan 


Height 


62 64 66 68 70 72 74 


Figure 12.22 Wingspan plotted against height 


The correlation is computed to be .9422. Can it be concluded that true correlation between 
wingspan and height exceeds .8? To carry out a test of Ho: p = .8 versus H,: p > .8, we Fisher 
transform .9422 and .8: 


1 /1+.9422 i, (12s 
or 2 a 59 ==In(———) =1. 
= o(; = aaa5) PPS y= 5 o(; = =) se 


The z test statistic is z= (1.757 — 1.099) /(1/V/16 — 3) = 2.37. Since 2.37 > zo = 2.33, at level 
.O1 we can reject Hp: p = .8 in favor of H,: p > .8 and conclude that wingspan is highly correlated 
with height. a 


To obtain a CI for p, we first derive an interval for wy = 5In[(1+p)/(1 — p)]. Standardizing V, 
writing a probability statement, and manipulating the resulting inequalities yields 


1 l+r Zu/2 
vty -ov = sin( =") ye a 


as the endpoints of a 100(1 — «)% interval for uy. This interval can then be manipulated to yield a CI 
for p. 


A 10001 — «)% confidence interval for p is 
ei — | e22 — | 
e447? e'24] 


where c, and cy» are the left and right endpoints, respectively, in Expression (12.10). 


Example 12.17 (Example 12.16 continued) The sample correlation coefficient between wingspan 
and height was r = .9422, giving v = 1.757. With n = 16, a 95% confidence interval for fy is 


1.757 + 1.96/16 — 3 = (1.213, 2.301) = (c1,c2). 
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The 95% interval for p is 


21.213) 4 222301) — 1 
(Sia 41? 22301) *) = (.838, 980) 
Notice that this interval excludes .8, and that our hypothesis test in Example 12.16 would have 
rejected Ho: p = .8 in favor of the alternative H,: p > .8 at the .025 level. i 


Absent the assumption of bivariate normality, a bootstrap procedure can be used to obtain a CI for 
p or test hypotheses. 

In Chapter 5, we cautioned that a large value of the correlation coefficient (near 1 or —1) implies 
only association and not causation. This applies to both p and r. It is easy to find strong but weird 
correlations in which neither variable is casually related to the other. For example, since Prohibition 
ended in the 1930s, beer consumption and church attendance have correlated very highly. Of course, 


the reason is that both variables have increased in accord with population growth. 


Exercises: Section 12.5 (53-66) 


23: 


54. 


The article “Behavioural Effects of Mobile 
Telephone Use During Simulated Driving” 
(Ergonomics 1995: 2536-2562) reported 
that for a sample of 20 experimental sub- 
jects, the sample correlation coefficient for 
x = age and y = time since the subject had 
acquired a driving license (yr) was .97. 
Why do you think the value of r is so close 
to 1? (The article’s authors gave an 
explanation.) 


The Turbine Oil Oxidation Test (TOST) 
and the Rotating Bomb Oxidation Test 
(RBOT) are two different procedures for 
evaluating the oxidation stability of steam 
turbine oils. The article “Dependence of 
Oxidation Stability of Steam Turbine Oil 
on Base Oil Composition” (J. Soc. Tribol- 
ogists Lubricat. Engrs., Oct. 1997: 19-24) 
reported the accompanying observations on 
x = TOST time (hr) and y = RBOT time 
(min) for 12 oil specimens. 


TOST 4200 3600 3750 3675 4050 2770 
RBOT 370 340 375 310 350 200 


TOST 4870 4500 3450 2700 3750 3300 
RBOT 400 375 285 225 345 285 


a. Calculate and interpret the value of the 
sample correlation coefficient (as did 
the article’s authors). 


b. How would the value of r be affected if 
we had let x = RBOT time and y= 
TOST time? 


c. How would the value of r be affected if 
RBOT time was expressed in hours? 

d. Construct a scatterplot and normal 
probability plots and comment. 

e. Carry out a test of hypotheses to decide 
whether RBOT time and TOST time 
are linearly related. 


55. The authors of the paper “Objective Effects 


of a Six Months’ Endurance and Strength 
Training Program in Outpatients with 
Congestive Heart Failure” (Med. Sci. Sports 
Exerc. 1999: 1102-1107) presented a cor- 
relation analysis to investigate the rela- 
tionship between maximal lactate level 
x and muscular endurance y. The accom- 
panying data was read from a plot in the 


paper. 
x | 400 750 770 800 850 1025 1200 


y 3.80 400 490 5.20 4.00 3.50 6.30 


x | 1250 1300 1400 1475 1480 1505 2200 
y 688 7.55 4.95 7.80 445 6.60 8.90 


Sx = 36.9839, 
Syy = 7377.704 


Sy = 2,628,930.357, 
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56. 


A scatterplot shows a linear pattern. 


a. Test to see whether there is a positive 
correlation between maximal lactate 
level and muscular endurance in the 
population from which this data was 
selected. 

b. If a regression analysis was to be car- 
ried out to predict endurance from 
lactate level, what proportion of 
observed variation in endurance could 
be attributed to the approximate linear 
relationship? Answer the analogous 
question if regression is used to predict 
lactate level from endurance—and 
answer both questions without doing 
any regression calculations. 


Torsion during hip external rotation and 
extension may explain why acetabular 
labral tears occur in professional athletes. 
The article “Hip Rotational Velocities 
During the Full Golf Swing” (J. Sport Sci. 
Med. 2009: 296-299) reported on an 
investigation in which lead hip internal 
peak rotational velocity (x) and trailing hip 
peak external rotational velocity (y) were 
determined for a sample of 15 golfers. Data 
provided by the article’s authors was used 
to calculate the following summary 
quantities: 


Sy. = 64,732.83, 
Sey = 44,185.87 


Sy, = 130,566.96, 


Separate normal probability plots showed 

very substantial linear patterns. 

a. Calculate a point estimate for the pop- 
ulation correlation coefficient. 

b. If the simple linear regression model 
was fit to the data, what proportion of 
variation in external velocity could be 
attributed to the model relationship? 
What would happen to this proportion 
if the roles of x and y were reversed? 
Explain. 

c. Carry out a test at significance level .01 
to decide whether there is a linear 
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relationship between the two velocities 
in the sampled population; your con- 
clusion should be based on a P-value. 

d. Would the conclusion of (c) have 
changed if you had tested appropriate 
hypotheses to decide whether there is a 
positive linear association in the popu- 
lation? What if a significance level of 
.05 rather than .01 had been used? 


57. Hydrogen content is conjectured to be an 


58. 


important factor in porosity of aluminum 
alloy castings. The article “The Reduced 
Pressure Test as a Measuring Tool in the 
Evaluation of Porosity/Hydrogen Content 
in Al-7 Wt Pct Si-10 Vol Pct SiC(p) Metal 
Matrix Composite” (Metallurg. Trans. 
1993: 1857-1868) gives the accompanying 
data on x = content and y = gas porosity 
for one particular measurement technique. 


x 18 .20 21 21 21 122, .23 
y 46 70 Al AS Ps) 44 24 


x .23 .24 .24 25 .28 30 iat 
y AT 22 .80 88 70 2 75 


Minitab gives the following output in 

response to a correlation command: 

Correlation of Hydrcon and 

Porosity =0.449 

a. Test at level .05 to see whether the 
population correlation coefficient dif- 
fers from 0. 

b. If a simple linear regression analysis 
had been carried out, what percentage 
of observed variation in porosity could 
be attributed to the model relationship? 


The wicking properties of certain fabrics 
were investigated in the article “Thermal 
and Water Vapor Transport Properties of 
Selected Lofty Nonwoven Products” (Tex- 
tile Res. J. 2017: 1413-1424). Use the 
accompanying data and a .01 significance 
level to determine whether there is a sig- 
nificant correlation between thickness 
x (mm) and water vapor resistance 
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59. 


60. 


y (m?Pa/W). Is the result of the test sur- 
prising in light of the value of r? 


x 20 20 30 30 40 40 
y 60 56 65 70 96 78 


The body armor report introduced in 

Example 12.1 also reported studies in 

which two different methods were used to 

measure the same body armor deformations 
from bullet impact. The goal of these 
studies was to assess the extent to which 
two different measurement instruments 
agree. Eighty-three backface deformations 

(mm) were measured using a digital caliper 

(x) and a laser arm (y), resulting in a sample 

correlation coefficient of r = .878. 

a. Compute a 90% CI for the true corre- 
lation coefficient p. 

b. Test Ho: p = .8 versus H,: p > .8 at 
level .05. 

c. Ina regression analysis of y on x, what 
proportion of variation in laser arm 
measurements could be explained by 
variation in digital caliper 
measurements? 

d. If you decide to perform a regression 
analysis with digital caliper measure- 
ment as the response variable, what 
proportion of its variation is explain- 
able by variation in laser arm 
measurement? 


It is time-consuming and costly to have 
trucks stop in order to be weighed on a 
static scale. The Minnesota Department of 
Transportation considered using a scale that 
would weigh trucks while they were mov- 
ing. Here is data for a sample of trucks that 
were weighed in motion and also on a static 
scale (1000s of Ibs). 


Truck 1 2 3 4 5 

In-motion 26.0 29.9 39.5 25.1 31.6 
Static 27.9 29.1 38.0 27.0 30.3 
Truck 6 7 8 9 10 
In-motion 36.2 25.1 31.0 35.6 40.2 
Static 34.5 27.8 29.6 33.1 35.5 


61. 


62. 
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a. Determine the correlation 
coefficient r. 

b. Test Ho: p = .85 versus H,: p > .85 at 
level .05. 

c. How successful do you think the sim- 
ple linear regression model would be in 
predicting static weight from in-motion 


weight? Explain. 


sample 


A sample of n = 500 (x, y) pairs was col- 

lected and a test of Hp: p =O versus 

H,: p #0 was carried out. The resulting 

P-value was computed to be .00032. 

a. What conclusion would be appropriate 
at level of significance .001? 

b. Does this small P-value indicate that 
there is a very strong relationship 
between x and y (a value of p that 
differs considerably from 0)? Explain. 

c. Now suppose a sample of n = 10,000 
(x, y) pairs resulted in r = .022. Test 
Ho: p = 0 versus H,: p # 0 at level .05. 
Is the result statistically significant? 
Comment on the practical significance 
of your analysis. 


Let x be number of hours per week of 
studying and y be grade point average. 
Suppose we have one sample of (x, y) pairs 
for females and another for males. Then we 
might like to test the hypothesis 

Ho: Pp, — pP2 = O against the alternative that 

the two population correlation coefficients 

are different. 

a. Use properties of the transformed vari- 
able V = .SIn{(1 + R/C — R)] to pro- 
pose an appropriate test statistic and 
rejection region (let R, and R, denote the 
two-sample correlation coefficients). 

b. The paper “Relational Bonds and Cus- 
tomer’s Trust and Commitment: A Study 
on the Moderating Effects of Web Site 
Usage” (Serv. Ind. J. 2003: 103-124) 
reported that n,=26l1, r, =.59, 
nz = 557,12 = .50, where the first sample 
consisted of corporate website users and 
the second of nonusers; here r is the 
correlation between an assessment of the 
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63. 


strength of economic bonds and perfor- 
mance. Carry out the test for this data (as 
did the authors of the cited paper). 


Verify that the f ratio for testing Ho: B, = 0 
in Section 12.3 is identical to ¢ for testing 
Ho: p = 0. 


64. Verify Property 2 of the correlation coeffi- 


65. 


12 


cient: the value of r is independent of the 
units in which x and y are measured; that is, 
if x,=axj+c and y,=by;+d, a>0, 
b>O0, then r for the (xj,y;) pairs is the 
same as r for the (x;, y;) pairs. 


Consider a time series—that is, a sequence 
of observations X,, X>, ... on some response 
variable (e.g., concentration of a pollutant) 
over time—with observed values x), Xo, ..., 
X, Over n time periods. Then the lag J 
autocorrelation coefficient, which assess the 
strength of relationship between series 
values one time unit apart, is defined as 


ya; (= 2) (ai ==) 


y= =) 
ict (Xi — %) 
Autocorrelation coefficients 7, 73, ... for 
lags 2, 3, ... are defined analogously. 


a. Calculate the values of r,, ro, and r3 for 
the temperature data from Chapter | 
Exercise 95. 

b. Consider the pairs (x;, x2), (x2, X3), .--; 
(Xp—15 Xp). What is the difference between 
the formula for the sample correlation 
coefficient r applied to these pairs 
and the formula for r,;? What if n, the 
length of the series, is large? What 
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about rz compared to r for the n — 2 
pairs (x1, x3), (X2, X4), --+» An—2, Xn)? 

c. Analogous to the population correlation 
coefficient p, let p; G@=1, 2, 3, ...) 
denote the theoretical or long-run 
autocorrelation coefficients at the vari- 
ous lags. If all these p’s are zero, there 
is no (linear) relationship between 
observations in the series at any lag. In 
this case, if m is large, each R; has 
approximately a normal distribution 
with mean 0 and standard deviation 
1/,/n and different R,’s are almost 
independent. Therefore Hp: p; = 0 can 
be rejected at a significance level of 
approximately .05 if either r; >2/./n 
or r< —2/Vn. If n=100 and 
r, = .16, m2 = —.09, rz = —.15, is there 
evidence of theoretical autocorrelation 
at any of the first three lags? 

d. If you are testing the null hypothesis in 
(c) for more than one lag, why might 
you want to increase the cutoff constant 
2 in the rejection region? [Hint: What 
about the probability of committing at 
least one type I error?] 


66. Let s, and s, denote the sample standard 


deviations of the observed x’s and y’s, 
respectively. 
a. Show that S,,=(n—1)s? and simi- 
larly for the y’s. 
b. Show that an alternative expression for 
the estimated regression line Bo + B 1x is 
Sy 


y=ytr-—(x—3) 
Sy 


6 Investigating Model Adequacy: Residual Analysis 


In the last several sections we have taken for granted that our data is consistent with the simple linear 
regression model, which makes certain assumptions about the “true error” term ¢. Table 12.2 
summarizes these model assumptions. 
Since we do not observe the true errors in practice, our assessment of the plausibility of these 


assumptions is made based on the observed residuals e),. . 


.,€n. Certain graphs of the residuals, some 


of which appeared in the context of ANOVA in Chapter 11, will allow us to validate the regression 
assumptions. Moreover, these graphs may reveal other unusual or noteworthy features of the data. 
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Table 12.2 Assumptions of the simple linear regression model 
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Assumption In terms of Y In terms of ¢ 
Linearity E(Y|x) is a linear function of x. For any fixed x, E(e) = 0. 
Normality For any fixed x, the Y distribution For any fixed x, the rv ¢ is normally 


Constant variance 


is normal. 
The variance of Y at any fixed x value is 


distributed. 
Ve) = o, independent of x. 


independent of x. 
Y,’s for different observations are 
independent. 


é;'s for different observations are 
independent. 


Independence 


Residuals and Standardized Residuals 

Suppose the simple linear regression model is correct, and let y; = Bo + B 1x; be the predicted y value 
of the ith observation. Then the ith residual is e; = y; — }; = y; — ( Be + B ,x;). To derive properties of 
the residuals, let Y; — Y; represent the ith residual as a random variable (i.e., before observations are 
actually made). Then 


E(Y;) — E(By + Bixi) = Bo + Bix: — (Bo + Bixi) = 0 


= 

= 
| 

= 
I 


It can also be shown (Exercise 74) that 


V(Y%,-¥) =e |1-—--4 (12.11) 


Notice that the further x; is from x, the smaller the variance will be. This is because the least squares 
line is “pulled toward” observations whose x values are extreme relative to the other x values. 
Replacing o by s,. and taking the square root of Equation (12.11) gives the estimated standard 
deviation of the ith residual. 


Let’s now standardize each residual by subtracting the mean value (zero) and then dividing by the 
estimated standard deviation. 


DEFINITION The standardized residuals are 


If, for example, a particular standardized residual is 1.5, then the residual itself is 1.5 standard 
deviations larger than what would be expected from fitting the correct model. Though the standard 
deviations of the residuals differ from one another, if n is reasonably large the bracketed term in 
(12.11) will be approximately 1, so some sources use e7 * e;/s, as the standardized residual. 
Computation of the e;’s can be tedious, but the most widely used statistical computer packages 
automatically provide these values and can construct various plots involving them. 
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Example 12.18 Does stress really accelerate aging? A study described in the article “Accelerated 
Telomere Shortening in Response to Life Stress” (Proc. Nat. Acad. Sci. 2004: 17312-17315) inves- 
tigated the relationship between x = perceived stress level (on a quantitative scale) and y = telomere 
length, a biological measure of cell longevity (smaller telomere lengths indicate shorter lifespan at the 
cellular level). Figure 12.23 shows a scatterplot of (x, y) pairs for 38 subjects; the plot suggests a 
negative, weak-to-moderate (r = —.32) association between stress level and telomere length. 


Telomere length 


0.8 


T T T T T T > Perceived stress level 
5 10 15 20 25 30 35 


Figure 12.23 Scatterplot of data in Example 12.18 


The accompanying table displays the data, residuals, and standardized residuals obtained from 
software. The estimated standard deviations of the residuals are slightly different, e.g., for the first two 
observations, s,, © .156 while s,, ~ .157. 


Xj Ji ej ej Xj Ji ej ej 
14 1.30 0.068 0.439 7 1.35 0.065 0.431 
17 1.32 0.111 0.710 11 1.00 —0.255 —1.649 
14 1.08 —0.152 —0.971 15 1.24 0.016 0.103 
27 1.02 —0.112 —0.729 5 1.25 —0.050 —0.341 
22 1.24 0.070 0.446 21 1.26 0.082 0.524 
12 1.18 —0.067 —0.431 24 1.50 0.345 2.218 
22 1.18 0.010 0.062 21 1.24 0.062 0.396 
24 1.12 —0.035 —0.224 6 1.50 0.207 1.387 
25 0.94 —0.207 —1.337 20 1.30 0.114 0.729 
18 1.46 0.259 1.650 22 1.22 0.050 0.318 
28 1:22 0.096 0.627 26 0.84 —0.300 —1.941 
21 1.30 0.122 0.779 10 1.30 0.038 0.246 
19 0.84 —0.353 —2.250 18 1.12 —0.081 —0.515 
23 1.18 0.017 0.112 17 1.12 —0.089 —0.564 
15 1.22 —0.004 —0.025 20 1.22 0.034 0.220 
15 0.92 —0.304 —1.943 13 1.10 —0.139 —0.895 
27 1.12 —0.012 —0.078 33 0.94 —0.146 —0.994 
17 1.40 0.191 1.220 20 1.32 0.134 0.857 
6 1.32 0.027 0.182 29 1.30 0.183 1.209 
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Diagnostic Plots for Checking Assumptions 
The basic plots that many statisticians recommend for an assessment of model validity are the 
following: 


1. e* (or e;) on the vertical axis and x; on the horizontal axis—i.e., a plot of the (x;, e7) or (x;, e;) pairs 
2. e* (or e;) on the vertical axis and 7; on the horizontal axis—i.e., a plot of the (¥,, e7) or (5%, e;) pairs 


3. A normal probability plot of the e;’s (or e;’s) 


Plots 1 and 2 are called residual plots (against the explanatory variable and the fitted values, 
respectively). These two plots generally look quite similar, since ¥ is simply a linear function of x; the 
advantage of plot 2 is that we may also use it for assumption diagnostics in multiple regression, as 
we'll see in Section 12.7. Diagnostic plots 2 and 3 were both utilized in Chapter 11 for validating 
ANOVA assumptions. 

A residual plot can be used to validate two assumptions for simple linear regression: linearity and 
constant variance. Ideally, residuals should be randomly distributed about a horizontal line passing 
through 0. Figure 12.24 shows two prototype scatterplots and the corresponding residual plots. 
Figure 12.24a shows a scatterplot for which a straight line, at least initially, may appear a good fit. 
However, the associated residual plot exhibits strong curvature, suggesting the relationship between 
x and (the mean of) y is not actually linear. Figure 12.24b shows nonconstant variance: as x increases, 
so does the spread of the residuals about the mean line. (It is certainly possible to have a residual plot 
indicating both nonlinearity and nonconstant variance.) 
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Figure 12.24 Scatterplots and residual plots: (a) violation of the linearity assumption; 
(b) violation of the constant variance assumption 
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The assumption of normally distributed errors is, naturally, checked by a normal probability plot of 
the residuals or standardized residuals. As before, approximate normality of the residuals becomes 
less important as 1 increases for most of our inferential procedures in regression. For example, the 
t test in Section 12.3 is still valid for large n even if the residuals are clearly nonnormal. The 
exception to this is a prediction interval (PI) for a future y value presented in Section 12.4. 


Example 12.19 (Example 12.18 continued) Figure 12.25 presents a residual-versus-fit plot 
(e; vs. ¥;) and a normal probability plot of the e7’s for the stress—telomere data. The lack of a pattern 
in Figure 12.25a, e.g., lack of curvature, validates the linearity assumption, while the relatively equal 
vertical spread throughout the graph indicates the constant variance assumption is reasonable here. 
The points in Figure 12.25b plot are quite straight, suggesting that the standardized residuals—and, 
by extension, the true errors ¢—might reasonably come from a normally distributed population. 
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Figure 12.25 Plots for the data from Example 12.19 
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What about the final assumption, independence? Provided that the observations were obtained 
through random sampling (or, in the case of an experiment, treatments were randomly assigned to 
subjects), it is reasonable to treat the response values as independent. In Example 12.18, the 38 
subjects were volunteers, but there is no reason to think their stress levels or telomere lengths are 
related. (The lack of random sampling does, however, call into question the extent to which the 
study’s results can be generalized to a larger population of individuals.) a 


Other Diagnostic Tools 
Besides violations of the inference requirements for simple linear regression, bivariate data will 
sometimes present other difficulties: 


1. The selected model fits the data well except for a few discrepant or outlying data values, which 
may have greatly influenced the choice of the best-fit function. 

2. When the observations (x;, y;) appear in time order (i.e., the subscript i is actually a time index), 
the errors exhibit dependence over time. 

3. One or more relevant explanatory variables have been omitted from the model. 


Figure 12.26 presents plots corresponding to these three scenarios. Some unusual observations can 
be detected by a residual plot, particularly those with large standardized residuals (see Figure 12.26a). 
However, detection of all types of unusual observations can be difficult, especially in a multiple 
regression setting. A more complete analysis of unusual observations for both simple and multiple 
regression is presented in Section 12.9. 
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Figure 12.26 Plots that indicate abnormality in data: (a) a discrepant observation; (b) dependence in errors; 
(c) an omitted explanatory variable 


Figure 12.26b shows a plot of the standardized residuals against time order; this is only appro- 
priate when the data is collected sequentially (in successive time periods) rather than randomly. The 
line segments connecting successive points emphasize the sequential nature of the observations. 
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Observe that the residuals have a perfect alternating pattern: the first residual is above the mean line, 
the next one is below, the next above, and so on. This is an example of autocorrelation: a time- 
dependent pattern in the residuals. The methods of this chapter should be applied with great caution to 
modeling sequential observations—known as time series data—since the independence assumption is 
generally not met for this type of data. 

Figure 12.26c shows a plot of the e;’s against an explanatory variable other than x. The presence 
of a pattern suggests that this other explanatory variable should be added to the model, resulting in a 
multiple regression model. In Example 12.18, we might find that the residuals from the regression of 
y = telomere length on x; = stress level are linearly related to the values of x2 = subject’s age. If so, it 
makes sense to use both x; and x2 to predict y (the topic of Section 12.7). 


Remedies for Assumption Violations 

We now briefly indicate what remedies are available for some of the difficulties encountered in this 
section. Several of these are discussed in greater detail in subsequent sections of the book. For a more 
comprehensive discussion, one or more of the bibliographic references on regression analysis should 
be consulted. 

If the relationship between x and y appears nonlinear (e.g., as indicated in the residual plot of 
Figure 12.24a), then a model other than Y = fy + Bjx + & may be fit. This can be achieved by 
transformation of the x and/or y variable, or by inclusion of higher-order polynomial terms (see 
Section 12.8). 

Transformations of the y variable can also be used to remedy nonconstant variance. For instance, if 
the spread of the residuals grows with x as in the residual plot of Figure 12.24b, the transformation 
y’ = In(y) is often applied. Another popular approach to addressing nonconstant variance is the 
method of weighted least squares. The basic idea of weighted least squares is to find coefficients bo 
and b, to minimize the expression 


Sw(Bo, bi) = >_ wily: — (bo + Bix]? 


where the w;’s are “weights” determined by the variance structure of the errors. For example, if the 
standard deviation of Y is proportional to x (for x > 0)—that is, V(Y) = kx’—then it can be shown that 
the weights w; = 1/x? yield minimum-variance estimators of Bo and f;. The books by Kutner et al. 
and by Chatterjee and Hadi explain weighted least squares in detail (see the bibliography). Weighted 
least squares are used quite frequently by econometricians (economists who use statistical methods) 
to estimate parameters. 

Generally speaking, violations of the normality assumption cannot be fixed, though such a 
problem may naturally be resolved while addressing linearity and constant variance issues. Again, if 
the sample size is reasonably large then normality is not as important, except for prediction intervals. 
If a small data set has the feature that only the normality assumption is violated, consult your friendly 
neighborhood statistician for information on computer-intensive methods (e.g., bootstrapping). 

When plots or other evidence suggest that the data set contains outliers or points having large 
influence on the resulting fit, one possible approach is to omit these outlying points and re-compute 
the estimated regression equation. This would certainly be correct if it was found that the outliers 
resulted from errors in recording data values or experimental errors. If no assignable cause can be 
found for the outliers, it is still desirable to report the estimated equation both with and without 
outliers. Another approach is to retain possible outliers but to use an estimation principle that puts 
relatively less weight on outlying values than does the principle of least squares. One such principle is 
minimize absolute deviations (MAD), which selects by and b; to minimize >> |y; — (bo + b1x;,)|. 
Unlike the least squares estimates, there are no nice formulas for the MAD estimates; their values 


764 


12 Regression and Correlation 


must be found by using an iterative computational procedure. Such procedures are also used when it 
is suspected that the ¢,’s have a distribution that is not normal but instead has “heavy tails” (making it 
much more likely than for the normal distribution that discrepant values will enter the sample); robust 
regression procedures are those that produce reliable estimates for a wide variety of underlying error 


distributions. Least squares estimators are not robust, in the same way that the sample mean X is not a 
robust estimator for j. 


Exercises: Section 12.6 (67-76) 


67. 


69. 


The x values and standardized residuals for 
the temperature-energy use data of Exercise 49 
(Section 12.4) are displayed in the accom- 
panying table. Construct a standardized 


residual plot and comment on _ its 
appearance. 

48 53 56 58 58 
0.110 1.153 1.077 0.486 0.836 
59 59 60 68 69 
0.560 1.258 0.148 0.660 1.869 
69 70 B 715 719 
0.356 0.858 0.743 0.724 -0.622 
80 80 84 87 88 
-0.460 1471 -2133 L181  -0.302 


. Suppose the variables x = commuting dis- 


tance and y = commuting time are related 

according to the simple linear regression 

model with o = 10. 

a. If m= 5 observations are made at the 
x values x1) =5, x%.»=10, x3 = 15, 
x4 = 20, and x5 = 25, calculate the 
(true) standard deviations of the five 
corresponding residuals. 

b. Repeat part (a) for x, =5, x2 = 10, 
x3 = 15, x4 = 20, and x5 = 50. 

c. What do the results of parts (a) and 
(b) imply about the deviation of the 
estimated line from the observation 
made at the largest sampled x value? 


Nickel-based alloys are especially difficult 
to machine due to characteristics including 
high hardness and low thermal conductiv- 
ity. The article “Multi-response Optimiza- 
tion Using ANOVA and_ Desirability 
Function Analysis: A Case Study in End 
Milling of Inconel Alloy” (ARPN J. Engr. 


70. 


Appl. Sci. 2014: 457-463) reports the fol- 
lowing data on x =cutting velocity 
(m/min) and y = material removal rate 
(mm?/min) from one experiment. 


x 25 25 25 50 50 

y 258.48 268.80 270.18 338.58 343.86 
x 50 75 75 75 

y 354.24 414.36 424.80 451.80 
a. The LSRL_ for this’ data is 


y = 182.7 + 3.29x. Calculate and plot 
the residuals against x and then com- 
ment on the appropriateness of the 
simple linear regression model. 


b. Use s, = 11.759 to calculate the stan- 
dardized residuals from a simple linear 
regression. Construct a standardized 
residual plot and comment. Also con- 
struct a normal probability plot and 
comment. 


As the air temperature drops, river water 
becomes supercooled and ice crystals form. 
Such ice can significantly affect the 
hydraulics of a river. The article “Labora- 
tory Study of Anchor Ice Growth” (J. Cold 
Regions Engr. 2001: 60-66) described an 
experiment in which ice _ thickness 
(mm) was studied as a function of elapsed 
time (hr) under specified conditions. The 
following data was read from a graph in the 
article: n = 33; x = .17, .33, .50, .67, ..., 
5.50; y = .50, 1.25, 1.50, 2.75, 3.50, 4.75, 
5.75, 5.60, 7.00, 8.00, 8.25, 9.50, 10.50, 
11.00, 10.75, 12.50, 12.25, 13.25, 15.50, 
15.00, 15.25, 16.25, 17.25, 18.00, 18.25, 
18.15, 20.25, 19.50, 20.00, 20.50, 20.60, 
20.50, 19.80. 
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71. 


72. 


Investigating Model Adequacy: Residual Analysis 


a. The R® value resulting from a least 
squares fit is .977. Given the high R’, 
does it seem appropriate to assume an 
approximate linear relationship? 

b. The residuals, listed in the same order 
as the x values, are 


-1.03 -0.92 -1.35 -0.78 —0.68 -0.11 0.21 
—0.59 0.13 0.45 0.06 0.62 0.94 0.80 
—-0.14 0.93 0.04 0.36 1.92 0.78 0.35 

0.67 1.02 1.09 0.66 —0.09 1.33 —0.10 
—0.24 -0.43 -101 -1.75 —3.14 


Plot the residuals against x, and reconsider 
the question in (a). What does the plot 
suggest? 


The accompanying data on x = true density 
(kg/mm? ) and y = moisture content (% d. 
b.) was read from a plot in the article 
“Physical Properties of Cumin Seed” 
(J. Agric. Engr. Res. 1996: 93-98). 


x 7.0 9.3 13.2 16.3 19.1 22.0 


y 1046 1065 1094 1117 1130 1135 
The equation of the least squares line is 
y = 1008.14 + 6.19268x (this differs very 
slightly from the equation given in the 


article); s, = 7.265 and R? = .968. 


a. Carry out a test of model utility and 
comment. 


b. Compute the values of the residuals 
and plot the residuals against x. Does 
the plot suggest that a linear regression 
function is inappropriate? 

c. Compute the values of the standardized 
residuals and plot them against x. Are 
there any unusually large (positive or 
negative) standardized residuals? Does 
this plot give the same message as the 
plot of part (b) regarding the appropri- 
ateness of a linear regression function? 

Continuous recording of heart rate can be 

used to obtain information about the level 

of exercise intensity or physical strain 
during sports participation, work, or other 
daily activities. The article “The Relation- 
ship Between Heart Rate and Oxygen 


73. 


74. 


DD: 
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Uptake During Non-Steady State Exercise” 
(Ergonomics 2000: 1578-1592) reported 
on a study to investigate using heart rate 
response (x, as a percentage of the maxi- 
mum rate) to predict oxygen uptake (y, as a 
percentage of maximum uptake) during 
exercise. The accompanying data was read 
from a graph in the paper. 


HR 43.5 440 44.0 445 440 45.0 48.0 49.0 


VO, | 22.0 21.0 22.0 21.5 25.5 28.0 


HR 49.5 51.0 54.5 57.5 57.7 61.0 72.0 


VO, 32.0 29.0 38.5 30.5 57.0 40.0 580 72.0 
Use a statistical software package to per- 
form a simple linear regression analysis. 
Considering the list of potential difficulties 
in this section, see which of them apply to 


this data set. 


Example 12.6 presented the residuals from 
a simple linear regression of moisture 
content y on filtration rate x. 


a. Plot the residuals against x. Does the 
resulting plot suggest that a straight-line 
regression function is areasonable choice 
of model? Explain your reasoning. 

b. Using s, = .665, compute the values of 
the standardized residuals. Is e7 ~ e;/s¢ 
fori = 1, ..., n, or are the e;’s not close 
to being proportional to the e,’s? 

c. Plot the standardized residuals against 
x. Does the plot differ significantly in 
general appearance from the plot of 
part (a)? 

Express the ith residual Y; — y; (where 

Y; = By + Bx, in the form > ¢Y;, a linear 

function of the Y,’s. Then use rules of 

variance to verify that V(Y; — Y;) is given 

by Expression (12.11). 

Consider the following classic four (x, 

y) data sets; the first three have the same 

x values, so these values are listed only 

once (Frank Anscombe, “Graphs in 

Statistical Analysis,” Amer. Statist. 1973: 

17-21): 


1-3 1 2 3 4 4 


x y y y x y 
10.0 8.04 9.14 7.46 8.0 6.58 
8.0 6.95 8.14 6.77 8.0 5.76 
13.0 7.58 8.74 12.74 8.0 7.71 
9.0 8.81 8.77 7AL 8.0 8.84 
11.0 8.33 9.26 7.81 8.0 8.47 
14.0 9.96 8.10 8.84 8.0 7.04 
6.0 7.24 6.13 6.08 8.0 5.25 
4.0 4.26 3.10 5.39 19.0 12.50 
12.0 10.84 9.13 8.15 8.0 5.56 
7.0 4.82 7.26 6.42 8.0 7.91 
5.0 5.68 4.74 a3 8.0 6.89 


For each of these four data sets, the values 
of X, ¥, Six, Sy, and S,, are virtually 
identical, so all quantities computed from 
these five will be essentially identical for 
the four sets—the equation of the least 
squares line (y = 3 + .5x), SSE, 5, R’, 
t intervals, f statistics, and so on. The 
summaries provide no way of distinguish- 
ing among the four data sets. Based on a 
scatterplot and a residual plot for each set, 
comment on the appropriateness or inap- 
propriateness of fitting a straight-line 
model; include in your comments any 
specific suggestions for how a “straight- 
line analysis” might be modified or 
qualified. 


76. If there is at least one x value at which more 


than one observation has been made, the 
lack of fit test is a formal procedure for 
testing 
Ho: My = Bo + Bix for some values Bo, f; 
(the true regression function is linear) 
versus 
H,: Ho is not true (the true regression 
function is not linear) 
Suppose observations are made at c levels 
X1, Xa, ..., Xe. Let Yi, Yio, ..., Yin, denote the 
n,; observations when x = x; (i = 1, ..., c). 
With n=)°n;, SSE has n—2 df. We 
break SSE into two pieces, SSPE 
(pure error) and SSLF (lack of fit), as 
follows: 
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SSPE= 5° S_ (¥; — Yi)” 
ij 


SSLF = SSE — SSPE 


The n, observations at x; contribute n; — 1 
df to SSPE, so the number of degrees of 
freedom for SSPE is )°;(n; — 1) =n-c, 
and the degrees of freedom for SSLF is 
n-2—(n—c)=c—2. Let MSPE= 
SSPE/(n — c), © MSLF = SSLF/(c — 2). 
Then it can be shown that whereas 
E(MSPE) = oa” whether or not Hp is true, 
E(MSLF) =o? if Ho is true and 
E(MSLE) > o? if Hp is false. 

Test statistic: F = MSLF/MSPE 
Rejection region: f > Fy¢—2.n—c 

The following data comes from the article 
“Single Cell Isolation Process with Laser 
Induced Forward Transfer” (J. Bio. Engr. 
2017), with x = laser pulse energy (J), 
w=titanium thickness (nm) and 
y = number of viable cells resulting from 
a new cell-isolation technique. 


a. Construct a scatterplot of y vs x. Does 
it appear that x and the mean of y are 
linearly related? 


b. Carry out the lack of fit test on the 
(x, y) data at significance level .05. 


c. Repeat parts (a) and (b) for y vs w. 
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In multiple regression, the objective is to build a probabilistic model that relates a response variable 
y to more than one explanatory or predictor variable. Let k represent the number of predictor variables 
(k > 2) and denote these predictors by x), Xo, ..., x,. For example, in attempting to predict the selling 
price of a house, we might have k = 4 with x, = size (ft*), x. = age (years), x; = number of bed- 
rooms, and x4 = number of bathrooms. 


THE MULTIPLE There are parameters fo, f, ..., 6, and o such that for any fixed values 
LINEAR REGRESSION _ 0f the explanatory variables x), ...,x;,, the response variable Y is related 
MODEL to the x;’s through the model equation 

Y= Pot Bux t +++ + Bexet+e (12,12) 


where the random variable ¢ is assumed to follow a N(0, oc) distribution. 
It is also assumed that the ¢;’s (and thus the Y,’s) associated with 
different observations are independent of one another. 


As before, ¢ is the random error term (or random deviation) in the model, and the assumptions for 
statistical inference may be stated in terms of ¢. Equation (12.12) says that the true (or population) 
regression function, {) + /,x,; + --- + 6,x,, gives the expected value of Y as a function of x1, ..., x,. 
The f;’s are the true (or population) regression coefficients. 

Interpret the regression coefficients carefully! Performing multiple regression on k explanatory 
variables is not the same thing as creating k separate, simple linear regression models. For example, 
f, (the coefficient on the predictor x,) cannot be interpreted in multiple regression without reference 
to the other predictor variables in the model. Here’s a correct interpretation: 


f, = the change in the average value of y associated with a one-unit increase in x), 


adjusting for the effects of the other explanatory variables 


As an example, for the four predictors of home price mentioned above, , is interpreted as the change 
in expected selling price when size increases by 1 ft”, adjusting for the effects of age, number of 
bedrooms, and number of bathrooms. The other coefficients are interpreted similarly. 

Some statisticians refer to 8, as describing the effect of x, on y “after removing the effects of” the 
other predictors or “in the presence of” the other predictors; either of these interpretations is 
acceptable. You may hear f, defined as the change in the mean of y associated with a one-unit 
increase in x, “while holding the other variables fixed.” This is correct only if it is possible to increase 
the value of one predictor while the values of all others remain constant. 


Estimating Parameters 

The data in simple linear regression consists of n pairs (x1, y1), ---, (Xn, Yn). Suppose that a multiple 
regression model contains two explanatory variables, x, and x2. Then each observation will consist of 
three numbers (a triple): a value of x,, a value of x2, and a value of y. More generally, with k predictor 
variables, each observation will consist of k + 1 numbers (a “k + 1 tuple”). The values of the 
predictors in the individual observations are denoted using double-subscripting: 
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Xj = the value of the jth predictor x; in the ith observation 
(i= 1,...,.mj=1,...,4) 


Thus the first subscript is the observation number and the second subscript is the predictor number. 
For example, xg3 is the value of the third predictor in the eighth observation (to avoid confusion, a 
comma can be inserted between the two subscripts, e.g. x12,3). The first observation in our data set is 
then (x11, X12, ---, X14 V1), the second is (X21, X22, ..., X2%, Y2), and so on. 


Consider candidates bo,b;,...,b,; for estimates of the f,’s and the corresponding candidate 
regression function bo + b)x1 +--+ +bxxx. Substituting the predictor values for any individual 
observation into this candidate function gives a prediction for the y value that would be observed, and 
subtracting this prediction from the actual observed y value gives the prediction error. As in Sec- 
tion 12.2, the principle of least squares says we should square these prediction errors, sum, and then 
take as the least squares estimates Bos Bi, so By the values of the b;’s that minimize the sum of 
squared prediction errors. To carry out this procedure, define the criterion function (sum of squared 
prediction errors) by 


n 


8(bo, bi,..-, bk) = > [yi — (bo + bin +... + bexin)]?, 
i=l 


then take the partial derivative of 9(-) with respect to each b; (j = 0, 1, ..., k) and equate these k + 1 
partial derivatives to 0. The result is a system of k + 1 equations, the normal equations, in the k + 1 
unknowns (the 5;’s): 


nbo + (D2 xi)b1 + (Do x2)b2 + +++ + Ol xn) be = yi 
(32 x11 )Bo + (9 x4, ) Br + OO xi xin)bo + ++ + OD xara), = DY xy: (12.13) 


(30 xix) bo + DS xin) Dr +o + > i e-1%k ) Dea + o> x3.) Dk =o xii 


Notice that the normal equations are linear in the unknowns (because the criterion function is 
quadratic). We will assume that the system has a unique solution, the least squares estimates 


Bo, Bi, - ++, By. The result is an estimated regression function 


y= Bot Bia +-+> + Bx 


Section 12.9 uses matrix algebra to deal with the system of equations and develop inferential pro- 
cedures for multiple regression. For the moment, though, we shall take advantage of the fact that all of 
the commonly used statistical software packages are programmed to solve the equations and provide 
the results needed for inference. 

Sometimes interest in the individual regression coefficients is the main reason for a regression 
analysis. The article “Autoregressive Modeling of Baseball Performance and Salary Data” (Proc. of 
the Statistical Graphics Section, Amer. Stat. Assoc. 1988, 132-137) describes a multiple regression of 
runs scored as a function of singles, doubles, triples, home runs, and walks (combined with hit-by- 
pitcher). The estimated regression equation is 
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runs = —2.49 + .47 singles + .76 doubles + 1.14 triples + 1.54 home runs + .39 walks 


This is very similar to the popular slugging percentage statistic, which gives weight | to singles, 2 to 
doubles, 3 to triples, and 4 to home runs. However, the slugging percentage gives no weight to walks, 
whereas the estimated regression function puts weight .39 on walks, more than 80% of the weight it 
assigns to singles. The importance of walks is well known among statisticians who follow baseball, 
and it is interesting that there are now some statistically savvy people in major league baseball 
management who are emphasizing walks in choosing players. 


Example 12.20 Fuel efficiency of an automobile is determined to a large extent by various intrinsic 
characteristics of the vehicle. Consider the following multivariate data set consisting of n = 38 
observations on x, = weight (1000s of pounds), x. = engine displacement (i.e., engine size, in’), 
x3 = number of cylinders, and y = fuel efficiency, measured in gallons per 100 miles: 


x] X2 x3 ¥ x1 X2 X3 y 
4.360 350 8 5.92 3.830 318 8 5.49 
4.054 351 8 6.45 2.585 140 4 3.77 
3.605 267 8 5.21 2.910 171 6 4.57 
3.940 360 8 5.41 1.975 86 4 2.93 
2.155 98 4 3.33 1.915 98 4 2.85 
2.560 134 4 3.64 2.670 121 4 3.65 
2.300 119 4 3.68 1.990 89 4 3.17 
2.230 105 4 3.24 2.135 98 4 3.39 
2.830 131 5 4.93 2.670 151 4 3.52 
3.140 163 6 5.88 2.595 173 6 3.47 
2.795 121 4 4.63 2.700 173 6 3.13. 
3.410 163 6 6.17 2.556 151 4 2.99 
3.380 231 6 4.85 2.200 105 4 2.92 
3.070 200 6 4.81 2.020 85 4 3.14 
3.620 225 6 5.38 2.130 91 4 2.68 
3.410 258 6 5.52 2.190 97 4 3.28 
3.840 305 8 5.88 2.815 146 6 4.55 
3.725 302 8 5.68 2.600 121 4 4.65 
3.955 351 8 6.06 1.925 89 4 3.13 


We’ve chosen to use this representation of fuel efficiency (similar to European measurement), 
rather than the traditional American “miles per gallon” version, because the former is linearly related 
to our predictors while the latter is not. One consequence is that lower y values are better, in the sense 
that they indicate vehicles with better fuel efficiency. 

Our goal is to predict fuel efficiency (y) from the predictor variables x,, x2, and x3 (so k = 3). 
Figure 12.27 shows R output from a request to fit a linear function to the fuel efficiency data. The 
least squares coefficients appear in the estimate column of the coefficients block: 


Bo = —1.64351 B, = 2.33584 f,=—0.01065 3 = 0.21774 
Thus the estimated regression equation is 
y = —1.64351 + 2.33584x, — 0.01065x2 + 0.21774x3 


Consider an automobile that weighs 3000 Ibs, has a displacement of 175 in®, and is equipped with a 
six-cylinder engine. A prediction for the resulting fuel efficiency is obtained by substituting the values 
of the predictors into the fitted equation: 


770 12 Regression and Correlation 


Call: 
ImCformula = Fuel.Eff ~ weight + Disp + cyl, data = df) 


Residuals: 
Min 1Q Median 3Q Max 
-0.63515 -0.30801 0.00029 0.23637 0.63957 


Coefficients: 
Estimate Std. Error t value Pr(>|t|) 


(Intercept) -1.64351 0.48512 -3.388 0.001795 ** 
Weight 2.33584 0.28810 8.108 1.87e-09 *** 

Disp -0.01065 0.00269 -3.959 0.000364 *** 

cyl 0.21774 0.11566 1.883 0.068342 . 

Signit. cadess, <Q *s**? 0.001 °s*" 0.01 *** 0.05 *.? 0.4% 7 1 


Residual standard error: 0.3749 on 34 degrees of freedom 
Multiple R-squared: 0.9034, Adjusted R-squared: 0.8948 
F-statistic: 105.9 on 3 and 34 DF, p-value: < 2.2e-16 


Figure 12.27 Multiple regression output for Example 12.20 


§ = —1.64351 + 2.33584(3) — 0.01065(175) +0.21774(6) = 4.8067 


Such an automobile is predicted to use 4.8 gallons over 100 miles. This is also a point estimate of the 
mean fuel efficiency of all vehicles with these specifications (x, = 3, x2 = 175, x3 = 6). 


The intercept Bo = —1.64351 has no contextual meaning here, since you can’t really have a 


vehicle with no weight and no engine. The coefficient B, = 2.33584 on x, indicates that, after 
adjusting for both engine displacement (x2) and number of cylinders (x3), a 1000-Ib increase in weight 
is associated with an estimated increase of about 2.34 gallons per 100 miles, on average. Equiva- 
lently, a 100-lb weight increase is predicted to increase average fuel consumption by 0.234 gallons 
per 100 miles, accounting for engine size (i.e., displacement and number of cylinders). 

Consider fitting the simple linear model to just y and x3. The resulting LSRL is 


y = 1.057 + 0.6067 x3, 


suggesting that a one-cylinder increase is associated with about a 0.6 gallon increase in average fuel 
consumption per 100 miles. This is our estimate of the relationship between x3 and y ignoring the 
effects of any other explanatory variables; notice that it differs substantially from the previous 
estimate of 0.21774, which adjusted for both the vehicle’s weight and its engine displacement. (Since 
these cars have 4, 6, or 8 cylinders, it’s perhaps more appropriate to double these coefficients as an 
indication of the effect of a two-cylinder increase.) H 


Descriptive Measures of Fit 
The predicted (or fitted) value y; results from substituting the values of the predictors from the ith 
observation into the estimated equation: 


51 = By t+ Bian + + + Byxix 
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The corresponding residual is e; = y; —y;. As in simple linear regression, the closer the residuals are to 
zero, the better the job our estimated regression function is doing in predicting the y values in our 
sample. For the fuel efficiency data, the values of the three predictors in the first observation are 
X11 = 4.360, x12 = 350, and x3 = 8, so 


31 = —1.64351 + 2.33584(4.360) — 0.01065(350) + 0.21774(8) = 6.555 
é€) =y1 — 31 = 5.92 — 6.555 = —0.635 


Residuals are sometimes important not just for judging the quality of a regression. Several enter- 
prising students developed a multiple regression model using age, size in square feet, etc. to predict 
the price of four-unit apartment buildings. They found that one building had a large negative residual, 
meaning that the price was much lower than predicted. As it turned out, the reason was that the owner 
had “cash-flow” problems and needed to sell quickly. 

In simple linear regression, after fitting a straight line to bivariate data and obtaining the residuals, 
we calculated sums of squares and used them to obtain two assessments of how well the line 
summarized the relationship: the residual standard deviation and the coefficient of determination. 
Let’s now follow the same path in multiple regression: 


SSE= lef =) 701-5)’ SSR=S7i—-9) SST= 7 0%-5)" 


These are the same expressions introduced in Section 12.2, and again it can be shown that SST = 
SSE + SSR. The interpretation is that the total variation in the values of the response variable is the 
sum of explained variation (SSR) and unexplained variation (SSE). 

Each sum of squares has an associated number of degrees of freedom (df). In particular, 


df for SSE = n — (k+1) 


This is because the k + 1 coefficients Bo, 6,,..., 8, must be estimated before SSE can be obtained, 
entailing a reduction of k + 1 df for SSE. Notice that for the case of simple linear regression, k = 1 
and df for SSE =n — (1 + 1) =n — 2 as before. 


DEFINITION The standard deviation about the estimated multiple regression function 
(or simply the residual standard deviation) is 


SSE 
n—(k+1) 


Se = 


The coefficient of (multiple) determination, denoted by R’, is given by 


SSE SSR 


R=1-~== 2 
SST SST 


Roughly speaking, s, is the size of a typical deviation from the fitted equation. The residual standard 
deviation is also our point estimate of the model parameter o, i.e., 6 = Se. R° is the proportion of 
variation in observed y values that can be explained by (i.e., attributed to) the multiple regression 
model. Software often reports 100R*, the percent of explained variation. The closer R? is to 1, the 
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larger the proportion of observed y variation that is being explained. (In fact, the positive square root 
of R’, called the multiple correlation coefficient, turns out to be the sample correlation coefficient 
between the observed y,’s and the predicted ¥;’s—another measure of the quality of the fit of the 
estimated regression function.) 

Unfortunately, there is a potential problem with R? in multiple regression: its value can be inflated 
by including predictors in the model that are relatively unimportant or even frivolous. For example, 
suppose we plan to obtain a sample of 20 recently sold houses in order to relate sale price to various 
characteristics of a house. Natural predictors include interior size, lot size, age, number of bedrooms, 
and distance to the nearest school. Suppose we also include in the model the diameter of the doorknob 
on the door of the master bedroom, the height of the toilet bowl in the master bath, and so on until we 
have 19 predictors. Then unless we are extremely unlucky in our choice of predictors, the value of R 
will be 1 (because 20 coefficients are perfectly estimated from 20 observations)! Rather than seeking a 
model that has the highest possible R? value, which can be achieved just by “packing” our model with 
predictors, what is desired is a relatively simple model based on just a few important predictors whose 
R* value is high. 

It is therefore desirable to adjust R* to take account of the fact that its value may be quite high just 
because many predictors were used relative to the amount of data. The adjusted coefficient of 
multiple determination or adjusted R? is defined by 


MSE SSE/[n — (k+1)] n—1 SSE 


R=1 =1 =1 
3 MST SST/(n — 1) n—(k+1)SST 


The ratio multiplying SSE/SST in adjusted R* exceeds 1 (the denominator is smaller than the 
numerator), so adjusted R* is smaller than R? itself, and in fact will be much smaller when k is large 
relative to n. A value of R? much smaller than R’ isa warning flag that the chosen model has too 
many predictors relative to the amount of data. 


Example 12.21 (Example 12.20 continued) Figure 12.27 shows that s, ~ 0.3749, R? = 90.34%, 
and R? = 89.48% for the fuel efficiency data. Fuel efficiency predictions based on the estimated 
regression equation are typically off by about 0.375 gal/100 mi (positive or negative) from vehicles’ 
actual fuel efficiency. The model explains about 90% of the observed variation in fuel efficiency. 


A Model Utility Test 

In multiple regression, is there a single indicator that can be used to judge whether a particular model 
(equivalently, a particular set of predictors x), ..., x) will be useful? The value of R? certainly 
communicates a preliminary message, but this value is sometimes deceptive because it can be greatly 
inflated by using a large number of predictors (large k) relative to the sample size n (this is the 
rationale behind adjusting R7). 

The model utility test in simple linear regression involved the null hypothesis Ho: fh, = 0, 
according to which there is no useful relation between y and the single predictor x. Here we consider 
the assertion that 6, = 0, B. = 0, ..., 6, = 0, which says that there is no useful relationship between 
y and any of the k predictors. If at least one of these ;’s is not 0, the corresponding predictor(s) is 
(are) useful. The test is based on the F statistic derived from the regression ANOVA table (see 
Sections 10.5 and 11.1 for more about F tests). A prototype multiple regression ANOVA table 
appears in Table 12.3. 
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Table 12.3 ANOVA table for multiple regression 


Source of variation df Sum of squares Mean square f 
Regression k SSR MSR = SSR/k MSR/MSE 
Error n—-(k +1) SSE MSE = SSE/[n - (k + 1)] 

Total n-1 SST 


MODEL UTILITY 
TEST IN MULTIPLE 
REGRESSION 


Null hypothesis: Ho: 6, = 0, 2 = 0, ..., 6, = 0 
Alternative hypothesis: H,: at least one B; # 0 


SSR /k MSR 


SSE/[n —(k+1)] MSE ee) 


Test statistic value: f= 


When Ap is true, the test statistic has an F distribution with k numerator 
df and n — (k + 1) denominator df. 


Rejection Region for a Level « Test P-value 
f= Frakn—(k+1) area to the right of funder the 
Fryn—(k+1) Curve 


To understand why the test statistic value (12.14) should be compared to this particular F distribution, 
divide the fundamental ANOVA identity by a: 


SST SSE SSR 
oe. 6 - o 


When Hp is true, the observations Yj, ..., Y, all have the same mean ps = fo and common variance o’. It 
follows from a proposition in Section 6.4 that SST/o?~2_,. It can also be shown that 
(1) SSE/ ow Ln-(k +1) and (2) SSE and SSR are independent. Together, (1) and (2) imply—again, see 
Section 6.4—that SSR/o has a chi-squared distribution, with df = (n— 1)—(n—(k + 1)) = k. Finally, 
by definition the F distribution is the ratio of two independent chi-squared rvs divided by their respective 
dfs. Applying this to SSR/o* and SSE/o? leads to the F ratio 


SSR/o? 
k ____SSR/k_—_ MSR, 
SSE/o? _ SSE/|n _ (k+ 1)] _ MSE k,n—(k +1) 
n—(k+1) 


The test statistic is identical in structure to the ANOVA F statistic from Chapter 11: the numerator 
measures the variation explained by the proposed model, while the denominator measures the 
unexplained variation. The larger the value of F, the stronger the evidence that a statistically 
significant relationship exists between y and the predictors. In fact, the model utility test statistic value 
can be re-written as 


774 12 Regression and Correlation 


R? n—(k+1) 


f= TTR k 


so the test statistic is proportional to R°/(1 — R°), the ratio of the explained to unexplained variation. If 
the proportion of explained variation is high relative to unexplained, we would naturally want to 
reject Ho and confirm the utility of the model. However, the factor [m — (k + 1)]/k decreases as 
k increases, and if k is large relative to n, it will reduce f considerably. 

Because the model utility test considers all of the explanatory variables simultaneously, it is 
sometimes called a global F test. 


Example 12.22 What impacts the salary offers made to faculty in management and information 
science (MIS)? The accompanying table shows part of the data available on a sample of 167 MIS 
faculty, which includes the following variables (based on publicly available data provided by 
Prof. Dennis Galletta, University of Pittsburgh): 


y = Salary offer, in thousands of dollars 

x, = year the salary offer was made, coded as years after 2000 (so 2009 is coded as 9) 
Xz = previous experience, in years 

x3 = teaching load, converted into the number of three-credit semester courses per year. 


Observation Salary (y) Year (x) Experience (x2) Teaching load (x3) 
1 90.0 3 5 3 
2 91.5 3 12 4 
3 105.0 4 7 4 
4 79.2 6 3 =) 
5 9 0.5 6 


The model utility test hypotheses are 


Ao: B, = By = Bs = 0 
H,: at least one of these three f;’s is not 0 


Figure 12.28 shows output from the JMP statistical package. The values of s, (Root Mean Square 
Error), R’, and adjusted R* certainly suggest a useful model. The value of the model utility F ratio is 


SSR/k 20064.099/3 6688.03 


= = = = 50.8985 
SSE/[n— (k+1)]  21418.105/163 131.40 


: 


This value also appears in the F ratio column of the ANOVA table in Figure 12.28. Since 
f = 50.8985 > Fo13.163 * 3.9, Ho should be rejected at significance level .01. In fact, the ANOVA 
table in the JMP output shows that P-value < .0001. The null hypothesis should therefore be rejected 
at any reasonable significance level. We conclude that there is a useful linear relationship between 
y and at least one of the three predictors in the model. Note this does not mean that all three 
predictors are necessarily useful; we will say more about this shortly. 
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4 Summary of Fit 


RSquare 0.48368 
RSquare Adj 0.474177 
Root Mean Square Error 11.46296 
Mean of Response 95.42565 
Observations (or Sum Wots) 167 
4 Analysis of Variance 
Sum of 
Source DF Squares Mean Square _ F Ratio 
Model 3 20064.099 6688.03 50.8985 
Error 163 21418.105 131.40 Prob > F 
C. Total 166 41482.204 <.0001* 
4 Parameter Estimates 
Term Estimate Std Error t Ratio Prob>|t| 
Intercept 115.55831 3.969873 29.11 <.0001* 
Year 2.1853404 0.496914 440 <.0001* 


Experience 0.2068286 0.232131 0.89 0.3742 
TeachLoad -6.580262 0.564235 -11.66 <.0001* 


Figure 12.28 Multiple regression output from JMP for the data of Example 12.22 a 


Inferences about Individual Regression Coefficients 

When the assumptions for the multiple linear regression model are met, we may also construct CIs 
and perform hypothesis tests for the individual population regression coefficients f,..., By. 
Inferences concerning a single coefficient f; are based on the standardized variable 


_ BB, 


Si 


T 


which, assuming the model is correct, has a ¢ distribution with n — (k + 1) df. A matrix formula for 
Sp is given in Section 12.9, and the result is part of the output from all standard regression computer 
ef 


packages. A CI for f; allows us to estimate with confidence the effect of the predictor x; on the 

response variable, while adjusting for the effects of the other explanatory variables in the model. 
By far the most commonly tested hypotheses about an individual f; are Ho: f; = 0 versus 

H,: B; # 0, in which case the test statistic value implifies to t = B /Sp.- This null hypothesis states 


that, with the other explanatory variables in the model, the variable x; does not provide any additional 
useful information about y. This is referred to as a variable utility test. It is sometimes the case that a 
predictor variable will be judged useful for predicting y under the simple linear regression model 
using x; alone but not in the multiple regression setting (.e., in the presence of other predictors). This 
usually indicates that the other variables do a better job predicting y, and that the additional 
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information in x; is effectively “redundant.” Occasionally, the opposite will happen: a predictor that 
isn’t very useful by itself proves statistically significant in the presence of some other variables. 

Next, letx* = (xj,...,x) denote a particular value of x = (x1,...,x,). Then the point estimate of 
Hy|x+, the expected value of Y when x = x*, is y= Bo + B Xp pa The estimated standard 
deviation sy of the corresponding estimator is a complicated expression involving the sample x;;’s, but 
a matrix formula is given in Section 12.9. The better statistical software packages will calculate it on 
request. Inferences about j1y\,. are based on standardizing its estimator to obtain a ¢ variable having 
n — (k + 1) df. 


1. A 10001 — «)% CI for f;, the coefficient of x; in the model equation (12.12), is 


B; = tajan—(k+1) "Sp 


2. A test for Ho: f; = Bio uses the test statistic value t = (B; — Bjo)/sp, based on n — (k + 1) 


df. The test is upper-, lower-, or two-tailed according to whether H, contains the inequal- 
ity>,<,orFf. 
3. A 100(1 — «)% CI for py),+ is 


=> 
LU 


= by /2n—(k-+ 1) SY 


4. A 100(1 — «)% PI for a future y value is 


c by /2,n—(k +1) “V/ 8 +5} 


=> 
Lu 


Simultaneous intervals for which the joint confidence or prediction level is controlled can be obtained 
by applying the Bonferroni technique discussed in Section 12.4. 


Example 12.23 (Example 12.22 continued) The JMP output in Figure 12.28 includes variable 
utility tests and 95% confidence intervals for the coefficients. The results of testing Ho: f2 = 0 versus 
H,: Bo # 0 (x2 = years of experience) are 


By = .2068, sy =.2321, t= .2068/.2321=0.89, P-value = .3742 


*B 
so Ho is not rejected here. Adjusting for the year a salary offer was made and the position’s teaching 
load, years of prior experience does not provide additional useful information about MIS faculty 
members’ salary offers. The other two variables are useful (P-value < .0001 for each). A 95% CI for 


Bs is 


fs ae £025,167—(3 + 1) Sp, = —6.58 + 1.975(.5642) = (—7.694, —5.466), 


which agrees with the interval given in Figure 12.28. Thus after adjusting for offer year and prior 
experience, a one-course increase in teaching load is associated with a decrease in expected salary 
offer between $5466 and $7694. (If that seems counterintuitive, bear in mind that elite institutions can 
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offer both higher salaries and lighter teaching loads, while the opposite is true at a typical state 
university.) 

The predicted salary offer in 2015 (x, = 15) for a newly minted PhD (x2 = 0, no experience) and a 
five-course annual teaching load (x3 = 5) is 


$ = 115.558 + 2.18534(15) + .2068(0) — 6.580(5) = 115.437 


(that is, $115,437). The estimated standard deviation for this predicted value can be obtained from 
software: s; = 4.929. So a 95% confidence interval for the mean offer under these settings is 


J £ t005,16357 = 115.437 + 1.975(4.929) = (105.704, 125.171) 


which can also be obtained from software. A 95% prediction interval for a single future salary offer 
under these settings is 


§ £ to05,163 « 4/82 +53 = 115.437 + 1.975v/'11.463? + 4.9292 = (90.798, 140.076) 


Of course, the PI is much wider than the corresponding CI. 

Since x (years of prior experience) was deemed not useful, the model could also be re-fit without 
that variable, resulting in somewhat different estimated regression coefficients and, consequently, 
slightly different intervals above. a 


Assessing Model Adequacy 

The model assumptions of linearity, normality, constant variance, and independence are essentially 
the same for simple and multiple regression. Scatterplots of y versus each of the explanatory variables 
can give a preliminary sense of whether linearity is plausible, but the residual plots detailed in 
Section 12.6 are preferred. The standardized residuals in multiple regression result from dividing each 
residual e; by its estimated standard deviation; a matrix formula for the standard deviation of e; is 
given in Section 12.9. We recommend a normal probability plot of the standardized residuals as a 
basis for validating the normality assumption. Plots of the standardized residuals versus each pre- 
dictor and/or versus y should show no discernible pattern. If the linearity and/or constant variance 
conditions appear violated, transformation of the response variable (possibly in tandem with trans- 
forming some x;’s) may be required. The book by Kutner et al. discusses transformations as well as 
other diagnostic plots. 


Example 12.24 (Example 12.23 continued) Figure 12.29 shows a normal probability plot of the 
standardized residuals, as well as a plot of the standardized residuals versus the fitted values y,, for the 
MIS salary data. The probability plot is sufficiently straight that there is no reason to doubt the 
assumption of normally distributed errors. The residual-vs-fit plot shows no pattern, validating the 
linearity and constant variance assumptions. 

Plots of the standardized residuals against the explanatory variables x,, x2, and x3 (not shown) also 
exhibit no discernable pattern. 
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Figure 12.29 Residual plots for the MIS salary data 


Exercises: Section 12.7 (77-87) 


77. 


Let y = weekly sales at a fast-food outlet 
(in dollars), x; = number of competing 
outlets within a I1-mile radius, and 
xX = population within a 1-mile radius (in 
thousands of people). Suppose that the true 
regression model is 


Y = 10000 — 1400x,; + 2100x. + ¢ 


T 
100 


T T 
110 120 


Determine expected sales when the number 
of competing outlets is 2 and there are 8000 
people within a 1-mile radius. 


. Determine expected sales for an outlet that 


has three competing outlets and 5000 
people within a 1-mile radius. 

Interpret §, and p>. 

Interpret fo. In what context does this value 
make sense? 
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79. 


Cardiorespiratory fitness is widely recog- 
nized as a major component of overall 
physical well-being. Direct measurement of 
maximal oxygen uptake (VO,max) is the 
single best measure of such fitness, but 
direct measurement is time-consuming and 
expensive. It is therefore desirable to have a 
prediction equation for VO max in terms of 
easily obtained quantities. Consider the 
variables 


y = VO)max(L/min) 


x2 = age(yr) 
x3 = time necessary to walk 1 mile(min) 


x, = weight(kg) 


x4 = heart rate at the end of the walk(beats /min) 


Here is one possible model, for male 
students, consistent with the information 
given in the article “Validation of the 
Rockport Fitness Walking Test in College 
Males and Females” (Res. Q. Exerc. Sport 
1994: 152-158): 


¥ =5.0+.01x; 


.O5x2 — .13x3 — .Olxa+e 


fo) 


. Interpret (1 and 3. 

b. What is the expected value of VO.max 
when weight is 76 kg, age is 20 year, 
walk time is 12 min, and heart rate is 
140 beats/min? 

c. What is the probability that VO ,max 

will be between 1.00 and 2.60 for a 

single observation made when the values 

of the predictors are as stated in part (b)? 


Athletic footwear is a multibillion-dollar 
industry, and manufacturers need to 
understand which features are most impor- 
tant to customers. The article “Overall 
Preference of Running Shoes Can Be Pre- 
dicted by Suitable Perception Factors Using 
a Multiple Regression Model” (Human 
Factors 2017: 432-441) reports a survey of 
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100 young male runners in Beijing and 
Singapore. Each participant was asked to 
assess the Li Ning Hyper Arc (a running 
shoe) on five features: y = overall prefer- 
ence, x, = fit, x. = cushioning, x3 = arch 
support and x, = stability. All measure- 
ments were made on a 0-15 visual analog 
scale, with 0 = dislike 
15 = like extremely. 


extremely and 


a. The estimated regression equation 
reported in the article is 
y =-.66 + 35x, + 34x) + 093 + 32%. 
Interpret the coefficient on x2. [Note: 
The units are simply “points.” ] 

b. Estimate the true mean rating from 
runners whose ratings on fit, cushion- 
ing, arch support, and stability are 9.0, 
8.7, 8.9, and 9.2, respectively. (These 
were the average ratings across all 100 
participants.) What would be more 
informative than this point estimate? 

c. The authors report R* =.777 for this 
four-variable model. Perform a model 
utility test at the .01 significance level. 
Can we conclude that all four predic- 
tors provide useful information? 

d. The article also reports variable utility 
test statistic values for each predictor; 
in order, they are t = 6.23, 4.92, 1.35, 
and 5.51. Perform all four variable 
utility tests at a simultaneous .01 level. 
Are all four predictors considered 
useful? 


Roads in Egypt are notoriously haphazard, 
often lacking proper engineering consider- 
ation. The article “Effect of Speed Hump 
Characteristics on Pavement Condition” 
(J. Traffic Transp. Engr. 2017: 103-110) 
reports a study by two Egyptian civil 
engineering faculty of 52 speed bumps on a 
major road in upper Egypt. They measured 
each speed bump’s height (x,), width (x2), 
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Price 


10600 
2625 
10500 
1850 
20000 


and distance (x3) from the preceding speed 
bump (all in meters). They also evaluated 
each bump using a pavement condition 
index (PCI), a 0-100 scale with 100 the 
best condition. 


a. With y = PCI, the estimated regression 
equation reported in the article is y = 
59.05 — 243.668x, + 11.675x2 + .012x3. 
Interpret the coefficient on x,. [Hint: 
Does a one-meter increase make sense?] 

b. Estimate the pavement condition index 
of a speed bump 0.13 m tall, 2.5 m 
wide, and 666.7 m from the preceding 
speed bump. 

c. The authors report R* = .770 for this 
three-variable model. Interpret this 
value, and then carry out the model 
utility test at the .05 significance level. 


The accompanying data on sale price 
(thousands of dollars), size (thousands of 
sq. ft.), and land-to-building ratio for 10 
large industrial properties appeared in the 
paper “Using Multiple Regression Analysis 
in Real Estate Appraisal” (Appraisal J. 
2001: 424-430). 


L/B ratio 


2.011 
3.543 
3.632 
4.653 
1.712 


L/B ratio 


2.279 
3.123 
4.771 
7.569 
17.190 


Size 
2867 
1698 
1046 
1109 
405 


Price 
8000 

10000 
6670 
5825 
4517 


Size 
2167 
752 
2423 
225 
3918 


a. Use software to create an estimated 
multiple regression equation for pre- 
dicting the sale price of a property from 
its size and land-to-building ratio. 

b. Interpret the estimated regression 
coefficients in this example. 

c. Based on the data, what is the predicted 
sale price for a 500,000 ft.? industrial 
property with a land-to-building ratio 
of 4.0? 
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There has been a shift recently away from 
using harsh chemicals to dye textiles in 
favor of natural plant extracts. The article 
“Ecofriendly Dyeing of Silk with Extract of 
Yerba Mate” (Textile Res. J. 2017: 829- 
837) describes an experiment to study the 
effects of dye concentration (mg/L), tem- 
perature (°C), and pH on dye adsorption 
(mg of dye per gram of fabric). [The article 
also included pictures of the resulting color 
from each treatment combination; dye 
adsorption is a proxy for color here.] 


Conc. Temp. pH Adsorption 
10 70 3.0 250 
20 70 3.0 520 
10 90 3.0 387 
20 90 3.0 593 
10 70 4.0 157 
20 70 4.0 377 
10 90 4.0 225 
20 90 4.0 451 
15 80 3:5 353 
15 80 3.5 382 
15 80 3.5 373 


a. Obtain the estimated regression equa- 
tion for this data. Then, interpret the 
coefficient on temperature. 

b. Calculate a point estimate for mean dye 
adsorption when concentration = 
15 mg/L, temperature = 80 °C, and 
pH = 3.5 (ie., the settings of the last 
three experimental runs). 

c. The model utility test results in a test 
Statistic value of f = 44.02. What can 
be concluded at the « = .01 level? 

d. Calculate and interpret a 95% CI for the 
settings specified in part (b). 

e. Calculate and interpret a 95% PI for the 
settings specified in part (b). 

f. Perform variable utility tests on each of 
the predictors. Can each one be judged 
useful provided that the other two are 
included in the model? 
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a. Obtain and interpret R° and s, for the 
model with predictors x1, x2, and x3. 
Test for model utility using « = .05. 

c. Does the data support the manufactur- 
ing goal of (relatively) consistent elec- 
trical conductivity across differing 
values of the experimental factors? 
Explain. 


Electric vehicles have greatly increased in 
popularity recently, but their short battery 
life (with a few exceptions) continues to be 
of concern. The article “Design of Robust 
Battery Capacity Model for Electric Vehi- 
cle by Incorporation of Uncertainties” (Int. 
J. Energy Res. 2017: 1436-1451) includes 
the following data on temperature (°C), 
discharge rate, and battery capacity (A-h, 
ampere-hours) for a certain type of lithium- 
ion battery. 
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83. Carbon nanotubes (CNTs) are used for Temp  Disch. Capacity Temp  Disch. Capacity 
everything from structural reinforcement to rate rate 
communication antennas. The article “Fast 0 0.50 0.96001 0 1.25 1.06506 
Mechanochemical Synthesis of Carbon 0 1.75 0.85001 20 1.25 1.34459 
Nanotube-Polyanaline Hybrid Materials” : es i ee 
25 0.50 1.38001 30 1.25 1.32355 
(J. Mater. Res. 2018: 1486-1495) reported 40 1.75 1.05396 40 1.25 1.54713 
the following data on y=electrical con- 40 3.00 0.96337 50 1.25 1.47159 
ductivity (S/em), x; = multi-walled CNT 20 0.50 1.27100 0 150 0.85171 
weight (mg), x, = CN, weight (mg), and 20 1.75 1.24897 20 1.50 1.20890 
E 20 3.00 1.20751 25 1.50 1.29703 
x3 = water volume (ml) from one 45 3.00 1.32001 30 1.50 1.16097 
experiment. 50 3.00 1.39139 40 1.50 1.32047 
40 0.50 1.00208 50 1.50 1.36305 
: 50 0.50 1.44001 30 1.25 1.32355 
y x1 X2 X3 
ae aa a 25 1.75 1.30001 40 1.25 1.54713 
rer saa AG 4 30 1.75 0.82697 50 1.25 1.47159 
4512 40.6 20.5 5 50 1.75 1.42713 0 150 0.85171 
4153 203 40.4 2 0 1.00 1.08485 20 1.50 1.20890 
1.727 20.1 20.1 7 20 1.00 1.40556 25 1.50 1.29703 
2.415 0.0 40.4 7 25 1.00 1.47986 30 1.50 1.16097 
3.008 20.3 20.1 7 30 1.00 0.91734 40 1.50 1.32047 
3.869 20.3 20.3 7 40 1.00 1.18187 50 1.50 1.36305 
2.025 20.4 20.2 7 
sail ge of a a. Determine the estimated regression 
2.545 20.4 20.1 7 . : : 
4.426 20.3 0.0 ro) equation based on this data (y = capacity). 
2.863 40.4 20.3 12 b. Calculate a point estimate for the 
2.096 0.0 20.4 12 ; ae 
».006 208 40.6 2 expected capacity of a lithium-ion bat- 


tery of this type when operated at 30 °C 
with discharge rate 1.00 (meaning the 
battery should drain in 1 h). 

c. Perform a model utility test at the .05 
significance level. 

d. Calculate a 95% CI for expected 
capacity at the settings in part (b). 

e. Calculate a 95% PI for the capacity of a 
single such battery at the settings in 
part (b). 

f. Perform variable utility tests on both 
predictors at a simultaneous .05 sig- 
nificance level. Are both temperature 
and discharge rate useful predictors of 
capacity? 

Steel microfibers are an alternative to con- 

ventional rebar reinforcement of concrete 

structures. The article “Measurement of 

Average Tensile Force for Individual Steel 

Fiber Using New Direct Tension Test” 
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y 
213 
706 
440 
984 
155 
251 
311 
989 
326 
479 
324 
404 
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(J. Test. Eval. 2016: 2403-2413) proposes 
a new evaluation method for concrete 
specimens infused with twisted steel micro 
rebar (TSMR) fibers, which were loaded 
until a crack occurred. The accompanying 
data on y=load until cracking (lb), 
Xx, = diameter at break (in), x. = number of 
TSMR fibers per in’, and x3 = concrete 
compressive strength (psi) appears in the 
article. 


xy x2 X3 y x1 x2 3 
2.778 1.2 7020 688 2.735 3.4 6130 
2.725 3.3. 7020 180 2.705 2.6 6130 
2.683 3.4 7020 1046 2.755 7.7 6130 
2.835 8.2 5680 272 2.734 3.2 6880 
2.7799 2.3. 5680 1168 2.725 74 7270 
2.712 2.3 5680 821 2.750 6.1 7270 
2.712 1.6 7960 418 2.740 44 7270 
2.686 6.0 7960 1102 2.708 9.2 7390 
2.821 40 7960 56 2.860 1.6 2950 
2.810 2.7 7090 91 3.002 1.3 2950 
2.845 1.9 7090 97 2.749 1.9 2950 
2.755 3.0 7090 


a. Determine the estimated regression 
equation for this data. 

b. Perform variable utility tests on each of 
the three explanatory variables. Can 
each one be judged useful given that 
the other two are included in the 
model? 

c. Calculate R* and adjusted R* for this 
three-predictor model. 

d. Perform a multiple regression of y on 
just x. and x3. Determine both R? and 
adjusted R* for this reduced model. 
How do they compare to the values in 
part (c)? Explain. 

An investigation of a die casting process 

resulted in the accompanying data on 

x, = furnace temperature, x. = die close 

time, and y = temperature difference on the 

die surface (“A  Miultiple-Objective 

Decision-Making Approach for Assessing 

Simultaneous Improvement in Die Life and 

Casting Quality in a Die Casting Process,” 

Qual. Engr. 1994: 371-383). 
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1250 1300 1350 1250 1300 
6 7 6 7 6 
80 95 101 85 92 
1250 1300 1350 1350 
8 8 7 8 
87 96 106 108 


Minitab output from fitting the multiple 
regression model with predictors x, and x2 
is given here. 


The regression equation is 
tempdiff = —200 + 0.210 furntemp 


+ 3.00 clostime 


Predictor Coef Stdev t ratio Pp 
Constant —-199.56 11.64 —17.14 0.000 
furntemp 0.210000 0.008642 24.30 0.000 
clostime 3.0000 0.4321 6.94 0.000 
s=1.058 R-sq = 99.1% R-sq(adj) = 98.8% 


Analysis of variance 


Source DF SS MS F P 
Regression 2 715.50 357.75 319.31. 0.000 
Error 6 6.72 Lig 2, 

Total 8 722.22 
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a. Carry out the model utility test. 

b. Calculate and interpret a 95% confi- 
dence interval for fz, the population 
regression coefficient of x. 

c. When x, = 1300 and x2 = 7, the esti- 
mated standard deviation of Y is 
Sy = .353. Calculate a 95% confidence 
interval for true average temperature 
difference when furnace temperature is 
1300 and die close time is 7. 

d. Calculate a 95% prediction interval for 
the temperature difference resulting 
from a single experimental run with a 
furnace temperature of 1300 and a die 
close time of 7. 

e. Use appropriate diagnostic plots to see 
if there is any reason to question the 
regression model assumptions. 


The article “Analysis of the Modeling 
Methodologies for Predicting the Strength 
of Air-Jet Spun Yarns” (Textile Res. 
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J. 1997: 39-44) reported on a study carried 
out to relate yarn tenacity (y, in g/tex) to 
yarn count (x1, in tex), percentage polyester 
(x2), first nozzle pressure (x3, in kg/cm’), 
and second nozzle pressure (x4, in kg/cm”). 
The estimate of the constant term in the 
corresponding multiple regression equation 
was 6.121. The estimated coefficients for 
the four predictors were —.082, .113, .256, 
and —.219, respectively, and the coefficient 
of multiple determination was .946. 
Assume that n = 25. 
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specifies a useful linear relationship 
between the response variable and at 
least one of the four model predictors. 
Calculate the value of adjusted R? and 
comment. 

Calculate a 99% confidence interval for 
true mean yarn tenacity when yarn 
count is 16.5, yarn contains 50% 
polyester, first nozzle pressure is 3, and 
second nozzle pressure is 5 if the esti- 
mated standard deviation of predicted 
tenacity under these circumstances is 


a. State and test the appropriate hypothe- 350. 
ses to decide whether the fitted model 


12.8 Quadratic, Interaction, and Indicator Terms 


The fit of a multiple regression model can often be improved by creating new predictors from the 
original explanatory variables. In this section we discuss the two primary examples: quadratic terms 
and interaction terms. We also explain how to incorporate categorical predictor variables into the 
multiple regression model through the use of indicator variables. 


Polynomial Regression 

Let’s return for a moment to the case of bivariate data consisting of n (x, y) pairs. Suppose that a 
scatterplot shows a parabolic rather than linear shape. Then it is natural to specify a quadratic 
regression model: 


Y = Bot Byxt Pox” +6 


The corresponding population regression function f(x) = By +f)x+ fx" gives the mean or 
expected value of Y for any particular x. 


What does this have to do with multiple regression? Re-write the quadratic model equation as 
follows: 


Y = Bot BixitPox2+e where x} =x and x =x 


Now this looks exactly like a multiple regression equation with two predictors. Although we interpret 
this model as a quadratic function of x, the multiple linear regression model (12.12) only requires that 
the response be a linear function of the f,’s and ¢. Nothing precludes one predictor being a math- 
ematical function of another one. So, from a modeling perspective, quadratic regression is a special 
case of multiple regression. Thus any software package capable of carrying out a multiple regression 
analysis can fit the quadratic regression model. The same is true of cubic regression and even higher- 
order polynomial models, although in practice very rarely are such higher-order predictors needed. 

The coefficient 6, on the linear predictor x, cannot be interpreted as the change in expected 
Y when x, increases by one unit while x is held fixed. This is because it is impossible to increase 
x without also increasing x°. A similar comment applies to B2. More generally, the interpretation of 
regression coefficients requires extra care when some predictor variables are mathematical functions 
of others. 
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Example 12.25 Reconsider the solar cell data of Example 12.2. Figure 12.2 clearly shows a 
parabolic relationship between x = sheet resistance and y = cell efficiency. To calculate the “least 
squares parabola” for this data, software fits a multiple regression model with two predictors: 
xX, =x and x= x°. The first few rows of data for this scenario are as follows: 


y xXjy=X =x 


13.91 43.58 1898.78 
13.50 50.94 2594.63 
13.59 60.03 3603.60 
13.86 66.82 4464.91 


(In most software packages, it is not necessary to calculate x° for each observation; rather, the user can 
merely instruct the software to fit a quadratic model.) The coefficients that minimize the residual sum of 


squares are Bo = 4.008, B , = -3617, and Bo = —.003344, so the estimated regression equation is 
y = 4.008 + .3617x, — .003344x. = 4.008 + .3617x — .003344x" 


Figure 12.30 shows this parabola superimposed on a scatterplot of the original (x, y) data. Notice that 
the negative coefficient on x* matches the concave-downward contour of the data. 


Efficiency 


Sheet resistance (ohms) 
40 50 60 70 80 90 


Figure 12.30 Scatterplot for Example 12.25 with a best-fit parabola 
The estimated equation can now be used to make estimates and predictions at any particular 


x value. For example, the predicted efficiency at x = 60 ohms is determined by substituting x, = 
x = 60 and x5 = x* = 607 = 3600: 


y = 4.008 + .3617(60) — .003344(60)" = 13.67 percent 
Using software, a 95% CI for the mean efficiency of all 60-ohm solar panels is (13.50, 13.84), while a 


95% PI for the efficiency of a single future 60-ohm panel is (12.32, 15.03). As always, the prediction 
interval is substantially wider than the confidence interval. a 
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Models with Interaction 

Suppose that an industrial chemist is interested in the relationship between product yield (y) from a 
certain reaction and two explanatory variables, x; = reaction temperature and x2 = pressure at which 
the reaction is carried out. The chemist initially proposes the relationship 


Y = 1200+ 15x; — 35x. +6 


for temperature values between 80 and 100 in combination with pressure values ranging from 50 to 
70. The population regression function 1200 + 15x, — 35x2 gives the mean y value for any particular 
values of the predictors. Consider this mean y value for three different particular temperatures: 


x; = 90: mean y value = 1200 4+ 15(90) — 35x2 = 2550 — 35x2 
x; = 95: mean y value = 2625 — 35x 
x, = 100: mean y value = 2700 — 35x2 


Graphs of these three mean y value functions are shown in Figure 12.31a. Each graph is a straight 
line, and the three lines are parallel, each with a slope of —35. Thus irrespective of the fixed value of 
temperature, the change in mean yield associated with a one-unit increase in pressure is —35. 


a b 


Mean y value Mean y value 


Figure 12.31 Graphs of the mean y value for two different models: (a) 1200 + 15x,~35x2; 
(b) —4500 + 75x; + 60x2 — x4x2 


In reality, when pressure increases the decline in average yield should be more rapid for a high 
temperature than for a low temperature, so the chemist has reason to doubt the appropriateness of the 
proposed model. Rather than the lines being parallel, the line for a temperature of 100 should be 
steeper than the line for a temperature of 95, and that line in turn should be steeper than the line for 
x, = 90. A model that has this property includes, in addition to x; and x», a third predictor variable, 
X3 =X, +X. One such model is 


Y = —4500 + 75x; + 60x. — x)x2 +6 


for which the population regression function is —4500 + 75x, + 60x2 — x)x2. This gives 
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(mean y value when temperature is 100) = —4500 + (75)(100) + 60x. — 100x2 
= 3000 — 40x2 
(mean value when temperature is 95) = 2625 — 35x, 


(mean value when temperature is 90) = 2250 — 30x2 


These are graphed in Figure 12.31b. Now each different value of x, yields a line with a different 
slope, so the change in expected yield associated with a l-unit increase in x2 depends on the value of 
x,. When this is the case, the two predictor variables are said to interact. 


DEFINITION _ If the effect on y of one explanatory variable x, depends on the value of a second 
explanatory variable x2, then x, and x, have an interaction effect on the (mean) 
response. 

We can model this interaction by including as an additional predictor x3 = x,x2, 
the product of the two explanatory variables, known as an interaction term. 


The general equation for a multiple regression model based on two explanatory variables x, and x 
and also including an interaction term is 


Y = fot Byx1 + Box2 + Baxx3+6 where x3 = x1x2 


When an interaction effect is present, this model will usually give a much better fit to the data than 
would the no-interaction model. Failure to consider a model with interaction too often leads an 
investigator to conclude incorrectly that the relationship between y and a set of explanatory variables 
is not very substantial. 

In applied work, quadratic predictors Ba and < are often also included, to model a potentially 
curved relationship. This leads to the complete second-order model 


Y = Bo t+ Byxi + Boxe + Bgxixe + Bax? + Bsxd + 


This model replaces the straight lines of Figure 12.31 with parabolas (each one is the graph of the 
population regression function as x2 varies when x, has a particular value). 


Example 12.26 The need to devise environmentally friendly remedies for heavy-metal contami- 
nated sites has become a global issue of serious concern. The article “Polyaspartate Extraction of 
Cadmium Ions from Contaminated Soil” (J. Hazard. Mater. 2018: 58-68) describes one possible 
cleanup method. Researchers varied five experimental factors: x, = polyaspartate (PA) concentration 
(mM), x2 = PA-to-soil ratio, x3 = initial cadmium (Cd) concentration in soil, x4, = pH, and 
Xs = extraction time (hours). One of the response variables of interest was y = residual Cd concen- 
tration, based on a total of n = 47 experimental runs. 
Consider fitting a “first-order” model using all five predictors. Results from software include 


y = —38.7 — .174x, — 1.308x2 + .242x3 + 10.26x4 + 2.328x5 
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Se = 33.124(df = 41), R? = 71.25%, R2 = 67.74% 


Variable utility tests indicate that all predictors except x, are useful; it’s possible that x, is redundant 
with x2. At the other extreme, a complete second-order model here involves 20 predictor variables: 


see : 5 x 2 ‘ 
the original x,’s (five), their squares (another five), and all (;) = 10 possible interaction terms: x;x2, 


X1X3, ..., X4X5. Summary quantities from fitting this enormous model include s, = 24.878 (df = 26), 
R* = 89.72%, and R> = 81.80%. The reduced standard deviation and greatly increased adjusted R 
both suggest that at least some of the 15 second-order terms are useful, and so it was wise to 
incorporate these additional terms. 

By considering the relative importance of the terms based on P-values, the researchers reduced 
their model to “just” 12 terms: all first-order terms, two of the quadratic terms, and five of the ten 
interactions. Based on the resulting estimated regression equation (shown in the article, but not here) 
researchers were able to determine the values of x,,...,x5 that minimize residual Cd concentration. 
(Optimization is one of the added benefits of quadratic terms. For example, in Figure 12.30, we can 
see there is an x value for which solar cell efficiency is maximized. A linear model has no such local 
maxima or minima.) 

It’s worth noting that while x, was not considered useful in the first-order model, several second- 
order terms involving x, were significant. When fitting second-order models, it is recommended to fit 
the complete model first and delete useless terms rather than building up from the simpler first-order 
model; using the latter approach, important quadratic and interaction effects can be missed. a 


One issue that arises with fitting a model with an abundance of terms, as in Example 12.26, is the 
potential to commit many type I errors when performing variable utility ¢ tests on every predictor. 
Exercise 94 presents a method called the partial F test for determining whether a group of predictors 
can all be deleted while controlling the overall type I error rate. 


Models with Categorical Predictors 

Thus far we have explicitly considered the inclusion of only quantitative (numerical) predictor 
variables in a multiple regression model. Using simple numerical coding, categorical variables such 
as sex, type of college (private or state), or type of wood (pine, oak, or walnut) can also be incor- 
porated into a model. Let’s first focus on the case of a dichotomous variable, one with just two 
possible categories—alive or dead, US or foreign manufacture, and so on. With any such variable, we 
associate an indicator (or dummy) variable whose possible values 0 and 1| indicate which category 
is relevant for any particular observation. 


Example 12.27 Is it possible to predict graduation rates from freshman test scores? Based on the 
median SAT score of entering freshmen at a university, can we predict the percentage of those 
freshmen who will get a degree there within six years? To investigate, let y = six-year graduation 
rate, x. = median freshman SAT score, and x, = a variable defined to indicate private or public status: 


Pas 1 if the university is private 
: 0 if the university is public 
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The corresponding multiple regression model is 
Y = Bo + Bix + Box2 + 
The mean graduation rate depends on whether the university is public or private: 


mean graduation rate = By + Byx2 when x; = 0 (public) 


mean graduation rate = By) + 6, + Box. when x; = | (private) 


Thus there are two parallel lines with vertical separation (,, as shown in Figure 12.32a. The coef- 
ficient £, is the difference in mean graduation rates between private and public universities, after 
adjusting for median SAT score. If 6, > 0 then, on average, for a given SAT, private universities will 
have a higher graduation rate. 


a b 
Mean y Mean y 
A A 


Private 


Public Private 


Public 


> X5 > Xo 


Figure 12.32 Regression functions for models with one indicator variable (x,) and one quantitative variable (x2): 
(a) no interaction; (b) interaction 


A second possibility is a model with an interaction term: 
Y = Bo + Byxi + Boxe + B3xixz + € 
Now the mean graduation rates for the two types of university are 


mean graduation rate = By + fox2 when x; = 0 (public) 


mean graduation rate = By + 6, + (P+ B3)x2 when x; = | (private) 


Here we have two lines, where 3, is the difference in intercepts and 3 is the difference in slopes, as 
shown in Figure 12.32b. Unless £3 = 0, the lines will not be parallel and there will be interaction 
effect, meaning that the separation between public and private graduation rates depends on SAT. 

To make inferences, we obtained a random sample of 20 Master’s level universities from the 2017 
data file available on www.collegeresults.org. 
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University Grad rate Median SAT Sector 
Appalachian State University 73.4 1140 Public 
Brenau University 49.4 973 Private 
Campbellsville University 34.1 1008 Private 
Delta State University 39.6 1028 Public 
DeSales University 70.1 1072 Private 
Lasell College 54.1 966 Private 
Marshall University 49.3 1010 Public 
Medaille College 43.9 906 Private 
Mount Saint Joseph University 60.7 1011 Private 
Mount Saint Mary College 53.8 991 Private 
Muskingum University 48.2 1009 Private 
Pacific University 64.4 1122 Private 
Simpson University 56.7 985 Private 
SUNY Oneonta 70.9 1082 Public 
Texas A&M University-Texarkana 29.7 1016 Public 
Truman State University 74.9 1224 Public 
University of Redlands 77.0 1101 Private 
University of Southern Indiana 39.6 1005 Public 
University of Tennessee-Chattanooga 45.2 1088 Public 
Western State Colorado University 41.0 1026 Public 


First of all, does the interaction predictor provide useful information over and above what is contained 
in x; and x2? To answer this question, we should test the hypothesis Ho: 63 = 0 versus H,: B3 4 0 first. 
If Ho is not rejected (meaning interaction is not informative) then we can use the parallel lines model to 
see if there is a separation (1) between lines. Of course, it does not make sense to estimate the difference 
between lines if the difference depends on x, which is the case when there is interaction. 

Figure 12.33 shows R output for these two tests. The coefficient for interaction has a P-value of 
roughly .42, so there is no reason to reject the null hypothesis Ho: 83 = 0. Since we fail to reject the 
“no-interaction” hypothesis, we drop the interaction term and re-run the analysis. The estimated 
regression equation specified by R is 


y = —124.56039 + 13.33553x,; + 0.16474x2 


The ft ratio values and P-values indicate that both Ho: B, = 0 and Hp: (2 = 0 should be rejected at the 
.05 significance level. The coefficient on x, indicates that a private university is estimated to have a 
graduation rate about 13 percentage points higher than a state university with the same median SAT. 


Coefficients: 
Estimate Std. Error t value Pr(>|t|) 


(Intercept) -152.37773 49.18039 -3.098 0.006904 ** 
SectorPrivate 70.06920 69.28401 1.011 0.326908 
Median. SAT 0.19077 0.04592 4.154 0.000746 *** 


SectorPrivate:Median. SAT -0.05457 0.06649 -0.821 0.423857 
Testing without interaction 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) -124.56039 35.29279 -3.529 0.002575 ** 
SectorPrivate 13.33553 4.64033 2.874 0.010530 * 
Median. SAT 0.16474 0.03289 5.009 0.000108 *** 


Figure 12.33 R output for an interaction model and a “parallel lines” model 
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At the same time, the coefficient on x2 shows that, after adjusting for a university’s sector (private or 
public), a 100-point increase in median SAT score is associated with roughly a 16.5 percentage point 
increase in the school’s six-year graduation rate. a 


You might think that the way to handle a three-category variable is to define a single numerical 
predictor with coded values such as 0, 1, and 2 corresponding to the three categories. This is 
incorrect: doing so imposes an ordering on the categories that is not necessarily implied by the 
context, and it forces the difference in mean response between the 0 and | categories to equal the 
difference for categories 1 and 2 (because | — 0 = 2 — | and the model is linear in its predictors). The 
correct way to incorporate three categories is to define two different indicator variables. Suppose, for 
example, that y = score on a posttest taken after instruction, x; = score on an ability pretest taken 
before instruction, and that there are three methods of instruction in a mathematics unit: (1) with 
symbols, (2) without symbols, and (3) a hybrid method. Then let 

a { 1 instruction method 1 2 { 1 instruction method 2 
0 otherwise ° 0 otherwise 
For an individual taught with method 1, x. = 1 and x3 = 0, whereas for an individual taught with 
method 2, x. = 0 and x3 = 1. For an individual taught with method 3, x2 = x3 = 0, and it is not 
possible that x. = x3 = 1 because an individual cannot be taught simultaneously by both methods 1 
and 2. The no-interaction model would have only the predictors x1, x2, and x3. The following 


interaction model allows the change in mean posttest score associated with a one-point increase in 
pretest to depend on the method of instruction: 


Y = Bot Byx1 + pox2 + B3x3 + Byxix2 + Bsx1x3 4+ € 


Construction of a picture like Figure 12.32 with a graph for each of the three possible (x2, x3) pairs 
gives three nonparallel lines (unless B4 = fs = 0). 

More generally, incorporating a categorical variable with c possible categories into a multiple 
regression model requires the use of c — 1 indicator variables (e.g., five methods of instruction would 
necessitate using four indicator variables). Thus even one categorical variable can add many pre- 
dictors to a model. 

Indicator variables can be used for categorical variables without any other predictors in the model. 
For example, consider Example 11.3, which compared the maximum power of five different ther- 
moelectric modules. Using a regression with four indicator variables (to represent the five categories) 
produces the exact same ANOVA table presented in Example 11.3. In particular, the “treatment sum 
of squares” SSTr in Chapter 11 and the “regression sum of squares” SSR of this chapter are identical, 
as are SSE, SST, and hence the results of the F test. In a sense, analysis of variance is a special case 
of multiple regression, with the only predictor variables being indicators for various categories. 

Analysis that involves both quantitative and categorical predictors, as in Example 12.27, is 
sometimes called analysis of covariance, and the quantitative variable(s) are called covariates. This 
terminology is typically applied when the effect of the categorical predictor is of primary interest, 
while the inclusion of the quantitative variables serves to reduce the amount of unexplained variation 
in the model. 
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Exercises: Section 12.8 (88—98) 


88. 


89. 


The article “Selling Prices/Sq. Ft. of Office 
Buildings in Downtown Chicago — How 
Much Is It Worth to Be an Old but Class-A 
Building?” (J. Real Estate Res. 2010: 1-22) 
considered a regression model to relate 
y = In($/ft’) to 16 predictors, including age, 
age squared, number of stories, occupancy 
rate, and indicator variables for whether it is 
a class-A building, whether a building has a 
restaurant and whether it has conference 
rooms. The model was fit to data resulting 
from 203 sales. 


a. The coefficient of multiple determina- 
tion was .711. What is the value of the 
adjusted coefficient of multiple deter- 
mination? Does it suggest that the rel- 
atively high R* value was the result of 
including too many predictors in the 
model relative to the amount of data 
available? 

b. Using the R? value from (a), carry out a 
test of hypotheses to see whether there 
is a useful linear relationship between 
the response variable and at least one of 
the predictors. 

c. The estimated coefficient of the indi- 
cator variable for whether or not a 
building was class-A was .364. Inter- 
pret this estimated coefficient, first in 
terms of y and then in terms of $/ft*. 

d. The ¢ ratio for the estimated coefficient 
of (c) was 5.49. What does this tell you? 


Cerium dioxide (also called ceria) is used in 
many applications, including pollution 
control and wastewater treatment. The 
article “Mechanical Properties of Gelcast 
Cerium Dioxide from 23 to 1500 °C” 
(J. Engr. Mater. Technol. 2017) reports an 
experiment to determine the relationship 
between y= elastic modulus (GPa) and 
x = temperature for ceria specimens under 
certain conditions. A scatterplot in the 
article suggests a quadratic relationship. 


90. 
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a. The article reports the estimated equation 
y=-1.92 x 10°x* — 0191x + 89.0. 
Over what temperature range does 
elastic modulus increase with temper- 
ature, and for what temperature range 
does it decrease? 

b. Predict the elastic modulus of a ceria 
specimen at 800 °C. 

c. The coefficient of determination is 
reported as R? = .948. Use the fact that 
the data consisted of m = 28 observa- 
tions to perform a model utility test at 
the .01 level. 

d. Information consistent with the article 
suggests that at x = 800, s; = 2.9 GPa. 
Use this to calculate a 95% CI for 
Hy|go0- 

e. The residual standard deviation for the 
quadratic model is roughly s, = 2.37 
GPa. Use this to calculate a 95% PI at 
x = 800, and interpret this interval. 


Many studies have researched how traffic 
load affects road asphalt, but fewer have 
examined the effect of extreme cold weather. 
The article “Effects of Large Freeze-Thaw 
Cycles on Stiffness and Tensile Strength of 
Asphalt Concrete” (J. Cold Regions Engr. 
2016) reports the following data on 
y = indirect tensile strength (MPa) and 
x = temperature (°C) for six asphalt speci- 
mens in one particular experiment. 


x —35 —20 -10 0 10 22 


y|) 3.01 3.56 3.47 2.72 215 1.20 


a. Verify that a scatterplot of the data is 
consistent with the choice of a quad- 
ratic regression model. 

b. Determine the estimated quadratic 
regression equation. 

c. Calculate a point prediction for the 
indirect tensile strength of this asphalt 
type at 0 °C. 
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d. What proportion of the observed varia- 
tion in tensile strength can be attributed 
to the quadratic regression relationship? 

e. Obtain a 95% CI for pyj\o, the true 


expected tensile strength of this asphalt 
type at 0 °C. 
f. Obtain and interpret a 95% PI at x = 0. 


Ethyl vanillin is used in food, cosmetics, 
and pharmaceuticals for its vanilla-like 
scent. The article “Determination and Cor- 
relation of Ethyl Vanillin Solubility in 
Different Binary Solvents at Temperatures 
from 273.15 to 313.15 K” VJ. Chem. Engr. 
Data 2017: 1788-1796) reported an 
experiment to determine y = ethyl vanillin 
solubility (mole fraction) as a function of 
X, = Initial mole fraction of the chemical 
propan-2-one in the solvent mixture and 
X2 = temperature (°K). The experiment was 
run at seven x; and nine x, values. The 
accompanying table shows the response, y, 
at each combination. [Note: 273.15 °K 
corresponds to 0 °C.] 


4 5 6 7 38 39 1.0 


10.6 
12.8 
13:7 
17.3: 
20.5 
23.7 
27.4 
34.3 
38.5 


13.4 
16.2 
18.8 
20.7 
25.2 
27.2 
31.6 
36.9 
42.1 


16.8 
19.2 
20.9 
23.8 
26.9 
29.8 
33.4 
38.7 
43.4 


18.5 
21.0 
23.1 
26.4 
29.1 
31.2 
35.0 
40.7 
45.3 


19.5, 
21.4 
23.4 
26.7 
29.6 
32.1 
36.2 
41.0 
45.5 


19.8 
21.5 
23.5 
26.9 
29.8 
32.4 
36.2 
41.3 
45.7 


19.9 
21.6 
23.7 
27.1 
30.0 
33.0 
36.9 
41.6 
45.9 


a. Create scatterplots of y versus x, and 
y versus x2. Does it appear the predic- 
tors are linearly related to y, or would 
quadratic terms be appropriate? 

b. Would a scatterplot of x; versus x2 
indicate whether an interaction term 
might be suitable? Why or why not? 

c. Perform a regression using the com- 
plete second-order model. Based on a 
residual analysis, does it appear that the 
model assumptions are satisfied? 

d. Test various hypotheses to determine 
which term(s) should be retained in the 
model. 


12 Regression and Correlation 


92. In the construction industry, a “project labor 


93. 


agreement” (PLA) between clients and 
contractors stipulates that all bidders on a 
project will use union labor and abide by 
union rules. The article “Do Project Labor 
Agreements Raise Construction Costs?” 
(CS-BIGS 2007: 71-79) investigated con- 
struction costs for 126 schools in Mas- 
sachusetts over an eight-year period. Among 
the variables considered were y = project 
cost, in dollars per square foot; x; = project 
size, in 1000s of ft’; XxX = 1 for new con- 
struction and 0 for remodel; and x3 = 1 if a 
PLA was in effect and 0 otherwise. 


a. What would it mean in this context to 
say that x, and x3 have an interaction 
effect? 

b. What would it mean in this context to 
say that x. and x3 have an interaction 
effect? 

c. No second-order terms are statistically 
significant here, and the estimated 
regression equation for the first-order 
model is y= 138.69 -— .1236x, + 
17.89x. + 18.83x3. Interpret the coeffi- 
cient on x,. Does the sign make sense? 

d. Interpret the coefficient on x3. 


The estimated standard error of Bs is 
4.96. Test the hypotheses Ho: f3 = 0 
vs. H,: B3 >0 at the .01 significance 
level. Does the data indicate that PLAs 
tend to raise construction costs? 


A regression analysis carried out to relate 
y = repair time for a water filtration system 
(hr) to x, = elapsed time since the previous 
service (months) and x2 = type of repair (1 
if electrical and 0 if mechanical) yielded the 
following model based on n = 12 obser- 
vations: y = .950 + .400x; + 1.250x2. In 
addition, SST = 12.72, SSE = 2.09, and 
53, = 312. 


a. Does there appear to be a useful linear 
relationship between repair time and 
the two model predictors? Carry out a 
test of the appropriate hypotheses using 
a significance level of .05. 
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b. Given that elapsed time since the last 
service remains in the model, does type 
of repair provide useful information 
about repair time? State and test the 
appropriate hypotheses using a signifi- 
cance level of .01. 

c. Calculate and interpret a 95% CI for fo. 

d. The estimated standard deviation of a 
prediction for repair time when elapsed 
time is six months and the repair is 
electrical is .192. Predict repair time 
under these circumstances by calculat- 
ing a 99% prediction interval. Does the 
interval suggest that the estimated 
model will give an accurate prediction? 
Why or why not? 


94. Sometimes an investigator wishes to decide 


whether a group of m predictors (m > 1) 
can simultaneously be eliminated from the 
model. The null hypothesis says that all 6’s 
associated with these m predictors are 0, 
which is interpreted to mean that as long as 
the other k — m predictors are retained in 
the model, the m predictors under consid- 
eration collectively provide no useful 
information about y. The test is carried out 
by first fitting the “full” model with all 
k predictors to obtain SSE(full) and then 
fitting the “reduced” model consisting just 
of the k — m predictors not being consid- 
ered for deletion to obtain SSE(red). The 
test statistic is 


[SSE(red) — SSE(full)]/m 


f=" SSE(full) In — (+1) 


The test is upper-tailed and based on 
m numerator df and n — (k + 1) denomi- 
nator df. This procedure is called the par- 
tial F test. 

Refer back to Example 12.26. The follow- 
ing are the SSEs and numbers of predictors 
for the first-order, complete second-order, 
and the researchers’ final model. 
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Model Predictors SSE 
First-order 5, 44,985 
Complete second-order 20 16,092 
Final 12 17,794 


a. Use the partial F test to compare the 
first-order and complete second-order 
models. Is there evidence that at least 
some of the second-order terms should 
be retained? 

b. Use the partial F test to compare the 
final model to the complete second- 
order model. Do you agree with the 
researchers’ decision to eliminate the 
“other” eight predictors? 


Utilization of sucrose as a carbon source for 
the production of chemicals is uneconomi- 
cal. Beet molasses is a readily available and 
low-priced substitute. The article “Opti- 
mization of the Production of /-Carotene 
from Molasses by Blakeslea_trispora” 
(J. Chem. Tech. Biotech. 2002: 933-943) 
carried out a multiple regression analysis to 
relate the response variable y = amount of 
f-carotene (g/dm?) to the three predictors: 
amount of linoleic acid, amount of kero- 
sene, and amount of antioxidant (all g/dm*). 


Linoleic Kerosene Antiox. Betacarotene 
30.00 30.00 10.00 0.7000 
30.00 30.00 10.00 0.6300 
30.00 30.00 18.41 0.0130 
40.00 40.00 5.00 0.0490 
30.00 30.00 10.00 0.7000 
13.18 30.00 10.00 0.1000 
20.00 40.00 5.00 0.0400 
20.00 40.00 15.00 0.0065 
40.00 20.00 5.00 0.2020 
30.00 30.00 10.00 0.6300 
30.00 30.00 1.59 0.0400 
40.00 20.00 15.00 0.1320 
40.00 40.00 15.00 0.1500 
30.00 30.00 10.00 0.7000 
30.00 46.82 10.00 0.3460 
30.00 30.00 10.00 0.6300 
30.00 13.18 10.00 0.3970 
20.00 20.00 5.00 0.2690 
20.00 20.00 15.00 0.0054 
46.82 30.00 10.00 0.0640 
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a. Fitting the complete second-order 
model in the three predictors resulted 
in R* = .987 and adjusted R? = .974, 
whereas fitting the first-order model 
gave R* = .016. What would you con- 
clude about the two models? 

b. For x, = x2 = 30, x3 = 10, a statistical 
software package reported that 
¥ = .66573, sy = .01785, and s,= 
.044 based on the complete second- 
order model. Predict the amount of 
f-carotene that would result from a 
single experimental run with the des- 
ignated values of the explanatory 
variables, and do so in a way that 
conveys information about precision 
and reliability. 


96. Snowpacks contain a wide spectrum of 
pollutants that may represent environmental 
hazards. The article “Atmospheric PAH 
Deposition: Deposition Velocities and 
Washout Ratios” (J. Environ. Engr. 2002: 
186-195) focused on the deposition of 
polyaromatic hydrocarbons. The authors 
proposed a multiple regression model for 
relating deposition over a specified time 
period (y, in ug/m”) to two rather compli- 
cated predictors x, (ug-s/m?>) and x2 
(ug/m?) defined in terms of PAH air con- 
centrations for various species, total time, 
and total amount of precipitation. Here is 
data on the species fluoranthene and 
corresponding Minitab output: 


x1 x2 y. 
92017 .0026900 278.78 
51830 .0030000 124.53 
17236 .0000196 22,65 
15776 .0000360 28.68 
33462 .0004960 32.66 

243500 .0038900 604.70 
67793 .0011200 27.69 
23471 .0006400 14.18 
13948 .0004850 20.64 

8824 .0003660 20.60 

7699 .0002290 16.61 
15791 .0014100 15.08 
10239 .0004100 18.05 
43835 .0000960 99.71 
49793 .0000896 58.97 
40656 .0026000 172.58 
50774 .0009530 44.25 
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The regression equation is 
filth dep = -33.5 + 0.00205 x1 + 29836 x2 


Predictor Coef SE Coef T P 
Constant —33.46 14.90 —2.25 0.041 
xl 0.0020548 0.0002945 6.98 0.000 
x2 29836 13654 2419 0.046 
S$=44.28 R-Sq = 92.3% R-Sq(adj) = 91.2% 


Analysis of variance 


Source DF ss MS F 
Regression 2 330,989 165,495 84.39 0.00 
Residual 14 27,454 1961 

error 

Total 16 35,8443 


Formulate questions and perform appropriate 
analyses. Construct the appropriate residual 
plots, including plots against the predictors. 
Based on these plots, justify adding a quadratic 
predictor, and fit the model with this additional 
predictor. Does this predictor provide addi- 
tional useful information over and above what 
x; and x, contribute, and does it help the 
appearance of the diagnostic plots? Also, the 
data includes a clear outlier. Re-run the 
regression without the outlier and determine 
whether quadratic terms are appropriate. 


97. The following data set has ratings from 
ratebeer.com along with values of IBU 
(international bittering units, a measure of 
bitterness) and ABV (alcohol by volume) 
for 25 beers. Notice which beers have the 
lowest ratings and which are highest. 


Beer IBU- ABV Rating 
Amstel Light 18 = 3.5 1.93 
Anchor Liberty Ale 54. 5.9 3.60 
Anchor Steam 33 4.9 3.31 
Bud Light 7 42 1.15 
Budweiser li 3 1.38 
Coors 14° =°55 1.63 
DAB Dark 32. 5 2.73 
Dogfish 60 min IPA 60 6 3.76 
Great Divide Titan IPA 65 6.8 3.81 
Great Divide Hercules Double IPA 85 9.1 4.05 
Guinness Extra Stout 60 5 3.38 
Harp Lager 21 43 2.85 
Heineken 23 5 213 
Heineken Premium Light 1 632 1.62 
Michelob Ultra 4 42 1.01 
Newcastle Brown Ale 18 47 3.05 
Pilsner Urquell 35. 44 3.28 
Redhook ESB 29 5.77 3.06 
Rogue Imperial Stout 88 11.6 3.98 
Samuel Adams Boston Lager 31 449 3.19 
Shiner Light 13. 4.03 2.57 
Sierra Nevada Pale Ale 37 0 53.6 3.61 
Sierra Nevada Porter 40 5.6 3.60 
Terrapin All-Amer. Imperial Pilsner 75 7.5 3.46 
Three Floyds Alpha King 66 «6 4.04 


P 
0 
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a. Find the correlations (and the corre- 
sponding P-values) among Rating, 
IBU, and ABV. 

b. Regress rating on IBU and ABV. 
Notice that although both predictors 
have strongly significant correlations 
with Rating, they do not both have 
significant regression coefficients. How 
do you explain this? 

c. Plot the residuals from the regression of 
(b) to check the assumptions. Also plot 
rating against each of the two predic- 
tors. Which of the assumptions is 
clearly not satisfied? 

d. Regress rating on IBU and ABV with 
the square of IBU as a third predictor. 
Again check assumptions. 

e. How effective is the regression in (d)? 
Interpret the coefficients with regard to 
statistical significance and sign. In par- 
ticular, discuss the relationship to IBU. 

f. Summarize your conclusions. 


The article “Promoting Healthy Choices: 
Information versus Convenience” (Amer. 
Econ. J... Appl. Econ. 2010: 164-178) 
reported on a field experiment at a fast-food 
sandwich chain to see whether calorie 
information provided to patrons would 
affect calorie intake. One aspect of the 
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study involved fitting a multiple regression 
model with seven predictors to data con- 
sisting of 342 observations. Predictors in 
the model included age and indicator vari- 
ables for sex, whether or not a daily calorie 
recommendation was provided, and whe- 
ther or not calorie information about choi- 
ces was provided. The reported value of the 

F ratio for testing model utility was 3.64. 

a. At significance level .01, does the 
model appear to specify a useful linear 
relationship between calorie intake and 
at least one of the predictors? 

b. What can be said about the P-value for 
the model utility F test? 

c. What proportion of the observed vari- 
ation in calorie intake can be attributed 
to the model relationship? Does this 
seem very impressive? Why is the P- 
value as small as it is? 

d. The estimated coefficient for the indicator 
variable calorie information provided 
was —71.73, with an estimated standard 
error of 25.29. Interpret the coefficient. 
After adjusting for the effects of other 
predictors, does it appear that true average 
calorie intake depends on whether or not 
calorie information is provided? Carry 
out a test of appropriate hypotheses. 
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Throughout this chapter we have explored linear models with both one and several predictors. It 
should perhaps not be surprising that such models can be imbedded in the language of linear algebra, 
i.e., in matrix form. In this section, we re-write the model equation and least squares estimates in terms 
of certain matrices and then derive matrix-based formulas for several of the quantities mentioned in 
earlier sections. (The focus here will be on multiple regression, since simple linear regression is just 
the special case where k = 1.) In fact, all software packages that perform regression analysis rely on 
these matrix representations for computation. 


The Model Equation in Matrix Form 
In Section 12.7 we used the following additive model equation to relate a response variable y to 
explanatory variables x, ..., xx: 


Y = Bo + Byxi + Box2t+ +++ + Bex +, 
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where ¢ ~ N(0, o) and the ¢;’s for different observations are independent of one another. Suppose 
there are n observations, each consisting of a y value and values of the k predictors (so each 
observation consists of k + 1 numbers). Then the 1 equations for the various observations can be 
expressed compactly using matrix notation: 


Y, = Bot Byxi1 + Boxi2 + +++ + Bex + 1 Y, Ll oxy oct: XE . é 
1 

- =12 3 J{+i]: 

Yn = Bo Te Bi xm imi foXn2 ap ee ay BXnk T Ey Y, 1 Xn ttt Xnk Be En 


(12.15) 


The dimensions of the four matrices in (12.15), from left to right, are n x 1, n x (k +1), 
(k+ 1) x 1, andn x 1. If we denote these four matrices by Y, X, B, and g, then the multiple linear 
regression model is equivalent to 


Y=XfB+¢e 


We will use y to denote then x 1 column vector of observed y values: y = [y),..., Vals where ' denotes 
matrix transpose. The vector y (or Y) is called the response vector, while the matrix X is known as the 
design matrix. The design matrix consists of one row for each observation (n rows total) and one column 
for each predictor, along with a leading column of 1’s to accommodate the constant term. 


Parameter Estimation in Matrix Form 
We now estimate fo, Bi, ..., 8, using the principle of least squares. Let ||u|| denote the (Euclidean) 


length of a column vector u, i.e., \Ju||?= > ur = wu. Then our goal is to find bo, by, ..., by, to minimize 


n 


(bo, b1,.--,bk) = a [yi — (bo + bixi + boxing + +++ + dexix)]” = |ly — Xb||’, 


i=1 


where b is the column vector with entries bo, bj, ..., by. One solution method was outlined in 
Section 12.7: if we set the partial derivatives of g with respect to bo, bi, ..., b, equal to zero, the result 
is the normal equations in Expression (12.13). In matrix form, (12.13) becomes 


n 
Xil ae aS Nik ‘s Ji 
i=1 i i=1 i 
n n 
wa pee se) ee hate by eo xayi 
i i=1 i=1 ‘ 


n n 


1 


n n n 
= Xik XikXi1 +s > XikXik » XikYi 
i=1 i=l i=1 j 


The matrix on the left is X'X and the one on the far right is X'y. The normal equations then become 
X’Xb = X’y. We will assume throughout this section that X’X has an inverse, so the vector of 
estimated coefficients is given by 
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p = b = (X'X) 'X’y (12.16) 


All statistical software packages use the matrix version of the normal equations to calculate the 


estimated regression coefficients. (At the end of this section, we present an alternative derivation of p 
that relies on linear algebra rather than partial derivatives from calculus.) 


Example 12.28 For illustrative purposes, suppose we wish to predict horsepower using engine size 
(liters) and fuel type (premium or regular) based on a sample of n = 6 cars. 


Horsepower Engine size Fuel type 
132 2.0 Regular 
167 2.0 Premium 
170 25 Regular 
204 2.5 Premium 
230 3.0 Regular 
260 3.0 Premium 


Define variables y = horsepower, x; = engine size, and x. = | for premium fuel and 0 for regular 
(an indicator variable). Then the response vector and design matrix here are 


132 1 2.0 0 
a i 96.0 6 15 3 1163 
Y=l5u} X=11 95 1] > XX= 15 385 7.5) and X’y = | 3003 
230 1 3.0 0 3 7.5 3 631 
260 1 30 1 


Notice that X’X is symmetric (and will be in all cases—do you see why?). The least squares estimates 
of the regression coefficients are 


; 79/12 —5/2 —1/3] [1163 —61.417 
p= (X’X) 'X’y=|-5/2 1 0 3003 | = | 95.5 
-1/3 0 2/3 | | 631 33 


Figure 12.34 shows R output from multiple regression using this (toy) data set. Notice that estimated 


regression coefficients exactly match our vector f. 


Residuals: 


2 3 4 5 6 
2.417 4.417 -7.333 -6.333 4.917 1.917 


Coefficients: 
Estimate Std. Error t value Pr(>|tl) 
(Intercept) -61.417 17.966 -3.419 0.041887 * 
x1 95.500 7.002 13.639 0.000853 *** 
x2 33.000 5.717 5.772 0.010337 * 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 § ’ 1 


Residual standard error: 7.002 on 3 degrees of freedom 
Multiple R-squared: 0.9865, Adjusted R-squared: 0.9775 
F-statistic: 109.7 on 2 and 3 DF, p-value: 0.001567 


Figure 12.34 R output for Example 12.28 | 
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Residuals, ANOVA, F, and R? 

The estimated regression coefficients can be used to obtain the predicted values and the residuals. 
Recall that the ith predicted value is y; = Bo + Byxin + Box Se a Buin. The vector of predicted 
values, ¥, is 


dy Bo + Bri + +++ + Byrn 


wet> 
I 
I 
I 
a 
=> 
l| 
ps 
‘e 
ca 
| 
~ 
<< 


Yn Bot Bitar + +++ + ByXnk 


If we define a matrix H = KR) RN, this relationship can be re-written as Y = Hy. The matrix H is 
amusingly called the hat matrix because it “puts a hat” on the vector y. 
The residual for the ith observation is defined by e; = y; — ¥;, and so the residual vector is 


eeyeyey ays (i= Hy, 


where I denotes the n x n identity matrix. Now the sums of squares encountered throughout this 
chapter can also be written in matrix form (more precisely, as squared lengths of particular vectors). 
Let y denote an nm x 1 column vector whose every entry is the sample mean y value, y. Then 


SSE =) e7 = |lell’= lly — 9° 
SSR = 9 (5-9) = ll - IP? 
SST = 90 Gi - 9)" = lly - 9? 


from which, as before, ce = MSE = SSE/[n — (k + 1)] and R? = SSR/SST. The fundamental 
ANOVA identity SST = SSR + SSE can be obtained as follows: 
SST = |ly—yll"= (y-9'(y-/) =l¥-H+ 9-H -N+ 9-H) 
= lly — 9I’ + lly — ¥|’= SSE+SSR 


The cross-terms in the matrix product are zero because of the normal equations (see Exercise 104). 
Equivalently, the middle two terms drop out because the vectors y — y and e = y — y are orthogonal. 
The model utility test of Ho: 8; =--- = 6, = 0 uses the same F ratio as before: 


p= MSR _ SSR/k lly —yll?/k 
_ MSE SSE/[n —(k+1)] - |lel|?/[n — (k+0)] 


Example 12.29 (Example 12.28 continued) The predicted values and residuals are easily obtained: 


1 2.0 0 129.583 132 129.583 2.417 
1 2.0 1 61.417 162.583 167 162.583 4.417 
= xp _ {i 25 0 95 50 _ | 177.333 625 170} _ | 177.333 | _ | —7.333 
1 25 1 33 210.333 204 210.333 —6.333 
1 3.0 0 225.083 230 225.083 4.917 
1 30 1 258.083 260 258.083 1.917 
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From these, SSE = ile||?= 2.4177 + -»-» +1.9177 = 147.083, MSE=SSE/[n -— (k+D]= 
147.083/[6 — (2 + 1)] = 49.028, and s, = V49.028 = 7.002. The total sum of squares is SST = 


ly — yl’= 2 (; — 193.83)? = 10,900.83, and the regression sum of squares can be obtained most 
easily by subtraction: SSR = SST — SSE = 10,900.83 — 147.083 = 10,753.75. The coefficient of 
multiple determination is R? = SSR/SST = .9865. Finally, the model utility test statistic value is 
Jf = MSR/MSE = (10,753.75/2)/49.028 = 109.67, a massive F ratio that decisively rejects Hp at any 
reasonable significance level. Notice that many of these quantities appear in the lower half of the R 
output in Figure 12.34. i 


Inference About Individual Parameters 
In order to develop hypothesis tests and confidence intervals for the regression coefficients, the 


expected values and standard deviations of the estimators Ba: Bi wes Bi are needed. 


DEFINITION Let U;,..., Um be rvs and U denote the m x 1 column vector [U,..., Um|]'. Then 
the mean vector of U is the m x 1 column vector p = E(U) whose ith entry is 
Li; = E(U;). The covariance matrix of U is the m x m matrix whose (i, /)th entry 
is the covariance of U; and U;. That is, 


Cov(U;,U;) Cov(U;,U2) --- Cov(U, Um) 

Cov(U2,U,) Cov(U2, U2) ++» Cov(U2, Um) 
Cov(U) = ; : 

Cov(Um,U1) Cov(Um,U2) ++» Cov(Um, Um) 


If we define the expected value of a matrix of rvs by the element-wise expectations of 
its entries, then it follows from the definition Cov(Uj, Uj) = E[(Ui — u;)(Uj — 4;)] 
that 


Cov(U) = E[(U — n)(U — p)' (12.17) 


The diagonal entries of the covariance matrix are the variances of the rvs: Cov(U;, U;) = V(U;). Also, 
the covariance matrix is symmetric, since Cov(U;, U;) = Cov(Uj, Ui). 

For example, suppose U, and U> are rvs with means 10 and —4, standard deviations 2.5 and 2.3, 
and covariance —1.1. Then the mean vector and covariance matrix of U = [U;, U>]' are 


eu) = [229] = 10] and come = [oof 2) Comin] =f 25 Ut) 


Now consider the vector of random errors ¢ = [é1,. . ., én]. The linear regression model assumes that 
the ¢;’s are independent (so covariance = 0 for each pair) with mean 0 and common variance a. 
Under these assumptions, the mean vector of € is 0 (ann x 1 vector of 0’s), while the covariance 
matrix of ¢ is o° I (ann x n matrix with 07 along the main diagonal and 0’s everywhere else) . It then 
follows from the model equation Y = Xf + ¢ that the (random) response vector Y satisfies 
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E(Y) =XBp+0=XB and Cov(Y) = Cov(s) = oI 


To determine the sampling distribution of , we will require the following proposition. 


PROPOSITION Let U be a random vector. If A is a matrix with constant entries and V = AU, 
then E(V) = AE(U) and Cov(V) = ACov(U)A’. 


Proof By the linearity of the expectation operator, E(V) = E(AU) = AE(U) = Ap. Then, using 
(12.17), 


Cov(V) = E[(V — E(V))(V — E(V))'] Equation (12.17) 
— E(AU — Ap)(AU — Aw) = E[A(U —n)(A(U — w) 
= E[A(U — p)(U — pA’) 
= AE|(U — p)(U — p)JA’ linearity of expectation 
= ACov(U)A’ 


Let’s apply this proposition to find the mean vector and covariance matrix of p. As an estimator, 


p = (X’X)'X’Y, so let A = (X’X) |X’ and U = Y. By linearity of expectation (the first part of the 
proposition), 


E(B) = (X'X)'X’E(Y) = (X’X) 'X’XB = B 


That is, Bi is an unbiased estimator of B (for each j, B. is unbiased for estimating /;). 
Next, the transpose of A is A’ = [(X’X)'X’]/ = X(X’X)7!; this relies on the fact that X’X is 


symmetric, so (x’x)! is symmetric as well. Applying the second part of the proposition and the 
earlier observation that Cov(Y) = ol, 


Cov(B) = ACov(¥)A’ = (X'X)'X'[o"I]X(X’X) 
= 0°(X'X) 1X'X(X’X) | = 07(X'X) | 


So, the variance of the regression coefficient B. is the jth diagonal entry of the matrix o7(X’ x 
Exercise 101 asks you to demonstrate that this matrix formula matches the variance formulas pre- 
sented earlier in the chapter for simple linear regression. Since o is unknown, it must be estimated 


from the data, and the estimated covariance matrix of f is s2(X’X)! 


Example 12.30 (Example 12.29 continued) For the engine horsepower scenario we previously 
found the matrix (X’X)~' and the residual standard deviation s, = 7.002. The estimated covariance 


matrix of p is 


79/12 —5/2 —1/3 322.766 —122.569 —16.343 
7.0027} —5/2 1 0 | = | —122.569 49.028 0.0 
-1/3 0 2/3 —16.343 0.0 32.685 
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The estimated standard deviations of the three coefficients are Si, = 322.766 = 17.966, 
53, = V49.028 = 7.002, and 53, = V 32.685 = 5.717. Notice that these exactly match the standard 


errors given in the R output of Figure 12.34. These estimated standard errors form the basis for the 
variable utility tests and Cls for the f,’s presented in Section 12.7. 


The covariance matrix also indicates that Cov(Bo, B |) & —122.569, meaning the estimates for the 
y-intercept and the slope on engine size are negatively correlated. In other words, if for a particular 


sample the estimated slope B , is greater than its expectation (the true coefficient (1), then typically 


the value of Bo will be Jess than fo. This makes sense for (x, y) values in the first quadrant: rotating 
from the true line y = Bo + fx, if sample data results in a slope estimate that is too high (so the line is 
overly steep), the y-intercept estimate will naturally be too low. iz 


What about the standard error of Y = por pie afer noe our point estimate of the mean 
response ly|,. at a specified set of x values? The point estimate may be written as Y=x* B, where 
x* is the row vector x* = [1,xj,...,x;]. Here, x* is constant but p (a vector of estimators) has 


sampling variability. Applying the earlier proposition, 


V(Y) = V(x*B) = x*V(B)[x*]’ = x*o?(X'X) | [x*]' = o?x*(X’X) | [x*]/ = 
sy = s2-x"(X’X) | [x] 


(It’s easy to verify here that the expression for 5} is al x 1 matrix and, as such, may be treated as a 


scalar.) The square root of this expression gives the estimated standard error of Y, which is required 
for confidence and prediction intervals. 


The Hat Matrix, Leverage, and Outlier Detection 
The foregoing proposition can also be used to find estimated standard deviations for the residuals. 


Recall that the n x n hat matrix is defined by H = X(X'X)~'X’.With the help of the matrix rules 
(AB) = B’A’ and (A_!)! = (A’) |, we find that H is symmetric, i.e., H’ = H: 
/ 
H’ = [x(x'x) |x’ = (X’)'[(X’K) 1] 'X! = X[(X’K)] Ox’ = X(X’x) x’ = 


Next, recall that the vector of predicted values is given by Y = HY: here, we’re treating the response 
vector Y as random, which implies that Y is also a random vector. Thus 


Cov(Y) = HCov(Y)H’ = X(X’X)"'X’[o?I]X(X’X) |X’ 


(12.18) 
= 0° X(X'X) |X’ = °H 
A similar calculation shows that the covariance matrix of the residuals is 
Cov(Y — Y) = o°(I— H) (12.19) 


The variances of Y; and e; are the diagonal entries of the matrices in (12.18) and (12.19), respectively. 
Of course, the value of o* is generally unknown, so the estimate 2 = MSE is used instead. If we let 
h;, denote the ith diagonal entry of H, then (12.18) and (12.19) imply that the (estimated) standard 
deviations of Y; and e; are 
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Sy, = Se Vii and Se, = Se: 1 — hi 


For the case of simple linear regression, it can be shown that these expressions match the standard 
error formulas given previously (Exercise 110). 

The hat matrix is also important as a measure of the influence of individual observations. Because y = Hy, 
yi = hyyy + +++ thy; + +++ +hinyn, and therefore the ith diagonal element of H measures the impact 
of the ith observation y; on its own predicted value y;. The h;;’s are sometimes called the leverages to 
indicate their impact on the regression. An observation with very high leverage will tend to pull the 
regression toward it, and its residual will tend to be small. Notice, though, that H depends only on the 
values of the predictors (through the design matrix X), so leverage measures only one aspect of influence. 


Example 12.31 Students in a statistics class measured their height, foot length, and wingspan 
(measured fingertip to fingertip with hands outstretched) in inches. The accompanying table shows 
the measurements for 16 students; we encountered this data previously in Example 12.16. The last 
column has the leverages for the regression of wingspan on height and foot length. 


Student Height (x;) Foot (x2) Wingspan (y) Leverage 
1 63.0 9.0 62.0 0.239860 
2 63.0 9.0 62.0 0.239860 
3 65.0 9.0 64.0 0.228236 
4 64.0 9.5 64.5 0.223625 
5 68.0 9.5 67.0 0.196418 
6 69.0 10.0 69.0 0.083676 
7 71.0 10.0 70.0 0.262182 
8 68.0 10.0 72.0 0.067207 
9 68.0 10.5 70.0 0.187088 

10 72.0 10.5 72.0 0.151959 

11 73.0 11.0 73.0 0.143279 

12 73.5 11.0 75.0 0.168719 

13 70.0 11.0 71.0 0.245380 

14 70.0 11.0 70.0 0.245380 

15 72.0 11.0 76.0 0.128790 

16 74.0 11.2 76.5 0.188340 


Figure 12.35 shows a plot of x, = height against x. = foot length, along with the leverage for each 
point. Notice that the points at the extreme right and left of the plot have high leverage, and the points 


Height (x,) 
74 0.17¢0.19 
0.15 00.14 
72 e e0.13 


Foot length (x,) 


9.0 9.5 10.0 10.5 11.0 ©6115 


Figure 12.35 Plot of height and foot length showing leverage 
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near the center have low leverage. However, it is interesting that the point with highest leverage is not 
at the extremes of height or foot length. This is student number 7, with a 10-in. foot and height of 71 
in., and the high leverage comes from the height being extreme relative to foot length. Indeed, when 
there are several predictors, high leverage often occurs when values of one predictor are extreme 
relative to the values of other predictors. For example, if height and weight are predictors, then an 
overweight or underweight subject would likely have high leverage. a 

Together, standardized residuals and leverages can be used to identify unusual observations in a 
regression setting. This is particularly helpful in multiple regression, where outliers are more difficult 
to detect with the naked eye (e.g., student 7 in Example 12.31). By convention, the ith observation is 
said to have a large residual if | e? | > 2 and a very large residual if | e | > 3, since these indicate that 
y; is more than two (resp., three) standard deviations away from the value predicted by the estimated 
regression function. 

As for the leverage values, it can be shown that 


0<hj<1 and Soha =k +1 


i=l 


where k is the number of predictor variables in the regression model. (In fact, it isn’t difficult to show 
directly that }~ hj = 2 for simple regression, i.e., k = 1.) This implies that the mean of the leverage 
values is (k + 1)/n, since there are n leverage values total (one for each observation). By convention, 
the ith observation is said to possess high leverage if h,, > 2(k + 1)/n and very high leverage if 
hi > 3(k + Lin. 

More sophisticated tools for outlier detection in regression are also available. If the “influence” of 
an observation is defined in terms of the effect on the predicted values when the observation is 
omitted, then an influential observation is one that has both large leverage and a large residual. 
A popular measure that combines leverage and residual is Cook’s distance; consult the book by 
Kutner et al. for more information. Many statistical software packages will provide the standardized 
residual, leverage, and Cook’s distance for all n observations upon request; some will also flag 
observations with unusually high values (e.g., according to the criteria above). 


Another Perspective on Least Squares 

We previously used multivariate calculus—in particular, the normal equations (12.13)—to determine 
the least squares estimates of the regression coefficients. The matrix representation of regression 
allows an alternative derivation of these estimates that relies instead on linear algebra. 


Let 1, x;, ..., x, denote the k + 1 columns of the design matrix X. The principle of least squares 
says we should determine coefficients bo, b,,...,b% that minimize 


D (yi — [bo + bixin + ++: + byxix])? = |ly — [bol +bixi+--- + dix? 


The expression in brackets is a (generic) linear combination of the vectors 1, x;, ..., X;. Since || - || 
denotes Euclidean distance, we know from linear algebra that such a distance is minimized by finding 
the projection of y onto the vector space (i.e., the closest vector to y in the space) spanned by 1, x,, 

..» Xz. Call this projection vector p. Since p lies in span{1, x;, ..., x;,} it must have the form 
P = bo14+ 51x, + «++ + D¢x, = Xb; our goal now is to find an explicit formula for the coefficients. 


804 12 Regression and Correlation 


Use the property that if p is the projection of y onto span{1, x;, ..., x,}, then the vector y — p must 
be orthogonal to that space. That is, the vector y — p must be perpendicular to each of 1, x), ..., X, 
meaning that 1’p = 0 and x/p = 0 for j = 1, ..., k. In matrix form, these k + 1 requirements can be 
written as X’(y — p) = 0. 

Put it all together: with p = Xb and X’(y — p) = 0, 


X'(y — p) = 0 = X'(y — Xb) = 0 = X’'y = X/Xb = b = (XX) 'X’y, 


matching the previous formula (12.16). Incidentally, the projection vector itself is p = Xb = Hy = ¥ 
(the vector of fitted values), and the vector orthogonal to the space is y — p = y — ¥ = e (the vector of 
residuals). 


Exercises: Section 12.9 (99-110) 


99. Consider fitting the model Y = By + Bix a. Determine the X and y matrices and 
+ fox. +6 to the following data: express the normal equations in terms 
of matrices. 
~ = - b. Determine the p vector, which contains 
-1 1 1 the estimates for the two coefficients in 
; e A the model. 
c. Determine ¥ and e. 
a. Determine X and y, and express the d. Calculate SSE (by summing the 
normal equations in terms of matrices. squared residuals) and then the esti- 
b. Determine the p vector, which contains mated variance MSE. 
the estimates for the three coefficients e. Use MSE and (X’X)7! to construct a 
in the model. 95% confidence interval for f,. 
c. Determine ¥ and e. Then calculate SSE, f. Carry out a ft test of Ho: 6, = 0 against 
and use this to get the estimated vari- a two-sided alternative. 
ance MSE. g. Carry out the F test of Hp: 8, = 0. How is 
d. Use MSE and (X’X)7' to construct a this related to part (f)? 
95% confidence interval for f}. 101. Consider the simple linear regression model 
e. Carry out a ¢ test for the hypothesis Y = fo+f\x+6, so k= 1 and X consists 
Ho: 6, = 0 against a two-tailed alter- of a column of 1’s and a column of the 
native, and interpret the result. values x1, ..., X, of x. 
f. Form the analysis of variance table, and a. Determine X'X and (X’X) ! using the 
catry out the F test for the hypothesis matrix inverse formula 
Ho: B, = B2 = 0. Find R* and interpret. 
100. Consider the model Y = fy + B,x1 +¢ for i ‘| | | d 2 
the following data: cd ad—be|—c a 


at b. Determine X’y, then calculate the 


coefficient vector . Compare your 
answers to the formulas given in Sec- 
tion 12.2. [Hint: Sy = )> xiyi—n-X-y, 
and similarly for S,,.] 


| 

in 
WNNe < 
ninnn 
co 1 © Oo] 
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102. 


103. 


104. 


105. 


c. Use (X'X) ! to obtain expressions for 
the variances of the coefficients, and 
check your answers against the results 
given in Sections 12.3 and 12.4. [Note: 
Bo is the predicted value corresponding 


to x* = 0, so the variance of Bo appears 
implicitly in Section 12.4.] 
Suppose we have bivariate data (x1, y,), ..., 
(Xp, Yn). Consider the centered model 


yi = Bot Bi(4 —X) +6; fori = 1, ..., n. 


a. Show that 


iy jn O 
al, 


b. Determine (X'X) | and the coefficient 
vector B. 

c. Determine the estimated standard errors 
of the regression coefficients. 

d. Compare this exercise to the previous 
one. Why is it more efficient to have 
Xj) = x; —X rather than x;; = x; in the 
design matrix? 

Consider the model Y; = fy + ¢; (so k = 0). 

Estimate fo from Equation (12.16). Find a 

simple expression for s By and then the 95% 


confidence interval for fo. [Note: Your 
result should be equivalent to the one- 
sample t confidence interval in Section 8.3.] 


a. Show that the normal equations are 
equivalent to X’e = 0. [Hint: Use the 
matrix representation of the normal 
equations in this section and substitute 
the formula for b = B.] 

b. Use part (a) to prove the ANOVA 
identity SST = SSE+ SSR by show- 
ing that (¥—y)'e=0. [Hint: Part 
(a) also implies that each row of X’ is 
orthogonal to e; in particular, the first 
column of X, a column of 1 1’s, satis- 
fies 1’e = 0.] 

Suppose that we have Y;, ..., Yin ~ Nu, 0), 

Ynels -- YVman ~ NGb, @), and all 

m +n observations are independent. These 

are the assumptions of the pooled 


106. 
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t procedure in Section 10.2. Let k= 1, 

Mia S295 hoes Ma S39) Bee Ss ess 

Xm4n,1 = —-5. For convenience in inverting 

X’'X assume m = n. 

a. Obtain Bo and B , from Equa- 
tion (12.16). [Hint: Let y, be the mean 
of the first mm observations and y, be the 
mean of the next n observations. | 

b. Find simple expressions for ¥, SSE, s,, 
and Spe 

c. Use parts (a) and (b) to find a simple 
expression for the 95% CI for B,. Show 
that your formula is equivalent to 


a 1 1 
By £ £025,m +n—28e4]— + — 
m n 

/1 1 

= yy _ a) + 1.025,m+n—-2° Pai Hae 

m n 


m = 42 m+n = \2 
i=1 Yi V1) ea (yi — Yo) 
a ee 


m+n—2 


which is the pooled variance confi- 
dence interval discussed in Section 9.2. 

d. Let m=3 and n=3, with y, = 117, 
yo = 119, y3 = 127, y4 = 129, 
ys = 138, ye = 139. These are the pri- 
ces in thousands for three houses in 
Brookwood and then three houses in 
Pleasant Hills. Apply parts (a), (b), and 
(c) to this data set. 


The constant term fo is not always needed 
in the regression equation. For example, 
many physical principles imply that the 
response variable should be 0 when the 
explanatory variables are 0, so the constant 
term is not needed. Then it is preferable to 
omit fo and use’ the model 

Y = B\x1 + Box. + +--+ + Byx, + 6. Here we 

focus on the special case k = 1. 

a. Differentiate the appropriate sum of 
squares to derive the one normal 
equation for estimating f). 

b. Express your normal equation in matrix 
form, where X consists of a single 


806 


107. 


108. 


109. 


column with the values of the predictor 
variable. 

Apply part (b) to the data of Example 
12.28, using hp for y and just engine 
size in X. 

Explain why deletion of the constant 
term might be appropriate for the data 
set in part (c). 

By fitting a regression model with a 
constant term added to the model of 
part (c), test the hypothesis that the 
constant is not needed. 


Prove that the hat matrix H satisfies 
H’ =H. 

Prove Equation (12.19). [Hint: Look at 
the derivation of Equation (12.18).] 


Use Eqs. (12.18) and (12.19) to show that 
each of the leverages is between 0 and 1, 
and therefore the variances of the predicted 
values and residuals are between 0 and a”. 


The measurements here are similar to those 
in Example 12.31, except that here the 
students did the measuring at home, and the 
results suffered in accuracy. 


Wingspan Foot Height 
74 13.0 75 
56 8.5 66 
65 10.0 69 
66 9.5 66 
62 9.0 54 
69 11.0 72 
75 12.0 75 
66 9.0 63 
66 9.0 66 
63 8.5 63 
a. Regress wingspan on the other two 


variables. Carry out the test of model 
utility and the tests for the two individual 
regression coefficients of the predictors. 
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Obtain the diagonal elements of the hat 
matrix (leverages). Identify the point 
with the highest leverage. What is 
unusual about the point? Given the 
instructor’s assertion that there were no 
students in the class less than five feet 
tall, would you say that there was an 
error? Give another reason that this 
student’s measurements seem wrong. 
For the other points with high lever- 
ages, what distinguishes them from the 
points with ordinary leverage values? 
Examining the residuals, find another 
student whose data might be wrong. 
Discuss the elimination of questionable 
points in order to obtain valid regres- 
sion results. 


110. Refer back to the centered simple regres- 
sion model in Exercise 102. 


Show that the leverage h;;,, the ith 
diagonal entry of the hat matrix H, is 
given by 


1. Gz) 


n Sez 


b. Show that the sum of the leverages in 


simple regression is 2. 

Use part (a) and the discussion of H in 
this section to confirm the following 
formulas from Sections 12.4 and 12.6: 


All of the regression models thus far have assumed a quantitative response variable y. (Section 12.8 
discussed how to incorporate a categorical explanatory variable using one or more indicators, but 
y was still numerical.) In this final section, we describe procedures for modeling the relationship 
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between a categorical response variable and one or more predictors. For example, university 
administrators may wish to predict whether a student will graduate (a yes-or-no variable) as a function 
of high school GPA, SAT scores, and number of extracurricular activities. Medical researchers 
frequently construct models to determine the effect of treatment dosage and other factors (age, weight, 
and so on) on whether or not someone contracts a certain disease. 


The Simple Logistic Regression Model 

The simple linear regression model is appropriate for relating a quantitative response variable y to a 
quantitative predictor x. But suppose we have a dichotomous categorical response variable, whose 
“values” are success and failure. We can encode this with a Bernoulli rv Y, with possible values 1 and 
0 corresponding to success and failure. As in previous chapters, let p = P(S) = P(Y = 1) and 
1 -p = P(F) = P(Y = 0). Frequently, the value of p will depend on the value of some quantitative 
variable x. For example, the probability that a car needs warranty service should depend on the 
car’s mileage, or the probability of avoiding an infection might depend on the dosage in an inocu- 
lation. Instead of using just the symbol p for the success probability, we now use p(x) to empha- 
size the dependence of this probability on the value of x. The simple linear regression equation 
Y = Bo + B\x + € is no longer appropriate, for taking the mean value on each side of that equation 
would give 


My|y = 1+ p(x) +0- [1 — p(x)] = pe) = Bo + Bix 


Whereas p(x) is a probability and therefore must be between 0 and 1, fo + 61x need not be in this 
range. 

Instead of letting the mean value of y be a linear function of x, we now consider a model in which 
the mean response p(x) is a particular nonlinear function of x. A function that has been found quite 
useful in many applications is the logit function 


obo + Bix 


p(x) = Lieb the (12.20) 


It is easy to see that the logit function is bounded between 0 and 1, so 0 < p(x) < | as desired. Fig- 
ure 12.36 shows a graph of p(x) for particular values of Bo and f, with 6, > 0. As x increases, the 
probability of success increases. For {, < 0, the success probability would be a decreasing function of x. 


D(x) 


Figure 12.36 A graph of a logit function 
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Logistic regression means assuming that p(x) is related to x by the logit function. Straightforward 
algebra shows that 


P(x) = eho Bix 
1 — p(x) 


The expression on the left-hand side is called the odds. If, for example p(60) = 3/4 = .75, then 
p(60)/(1 — p(60)) = .75/(1 — .75) =3 and when x = 60 a success is three times as likely as a 
failure. This is described by saying that the odds are 3 to 1 because the success probability is three 
times the failure probability. Taking natural logs of both sides, we see that the logarithm of the odds is 


a linear function of the predictor: 
P(x) 
] SEE — 
0 (Egy) = Path 


In particular, the slope parameter f, is the change in the log-odds associated with a one-unit increase 
in x. This implies that the odds itself changes by the multiplicative factor e*' when x increases by one 
unit. The quantity e”: is called the odds ratio, because it represents the ratio of the odds of success 
when the predictor variable equals x + 1 to the odds of success when the predictor variable equals x. 


Example 12.32 It seems reasonable that the size of a cancerous tumor should be related to the 
likelihood that the cancer will spread (metastasize) to another site. The article “Molecular Detection 
of p16 Promoter Methylation in the Serum of Patients with Esophageal Squamous Cell Carcinoma” 
(Cancer Res. 2001: 3135-3138) investigated the spread of esophageal cancer to the lymph nodes. 
With x = size of a tumor (cm) and Y = 1 if the cancer does spread, consider the logistic regression 
model with f, = .5 and fo = —2 (values suggested by data in the article). Then 


et 5X 


P(x) = 1t+e-2+-5* 


from which p(2) = .27 and p(8) = .88 (tumor sizes for patients in the study ranged from 1.7 to 9.0 cm). 
Because e72+ 506-77) x 4, the odds for a 6.77 cm tumor are 4, so that it is four times as likely as not that 
a tumor of this size will spread to the lymph nodes. Finally, for every 1-cm increase in tumor size, the 
odds of metastasis increase by a multiplicative factor of e° ~ 1.65, or 65%. Be careful here: the 
probability of metastasis is not increasing by 65%, but rather the odds; under the logistic regression 


model, the probability of an outcome does not increase linearly with x (see Figure 12.36). | 


Fitting the Simple Logistic Regression Model 

Fitting the logit model (12.20) to sample data requires that the parameters fo and f, be estimated. 
Rather than apply the principle of least squares from linear regression, the standard way to estimate 
logistic regression parameters is by the method of maximum likelihood. Suppose, for example, that 
n = 5 and that the observations made at x2, x4, and x5 are successes whereas the other two obser- 
vations are failures. Then the likelihood function is 


L(Bo, B;) = P(%1 = 0, ¥2 = 1, ¥3 = 0, Ys = 1, ¥5 = 1) 
= [1 — pi) fp @2)]0 — ps) I [POP Os) 


1 eFo + Bixa 1 efot+ Bix4 eho + Bixs 
a F + am F + aaa F + aaa F + aan F + aan 


12.10 Logistic Regression 809 


Unfortunately it is not at all straightforward to maximize this likelihood, and there are no nice 
formulas for the mles Be and B,. The maximization process must be carried out using iterative 
numerical methods. The details are involved, but fortunately the most popular statistical software 
packages will do this on request and provide both quantitative and graphical indications of how well 
the model fits. 


In particular, the mle B , is typically provided along with its estimated standard deviation Spe For 


large n, the mle has approximately a normal distribution and the standardized variable ( B 1—B,)/S A, 


has approximately a standard normal distribution. This allows for calculation of a confidence interval 
for 6, as well as for testing Ho: 6; = 0, according to which the value of x has no impact on the 
likelihood of success. 


Example 12.33 The following data resulted from a study commissioned by a large management 
consulting company to investigate the relationship between amount of job experience (x, in months) 
for a junior consultant and the likelihood of the consultant being able to perform a certain complex 
task. The value y = | indicates the consultant completed the task (success), whereas y = 0 corre- 
sponds to failure. 


x 4 5 6 6 7 8 9 10 11 11 13 13 14 15 18 
y 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 


x 18 19 20 20 21 21 22 23 25 26 27 28 29 30 32 
y 0 0 1 0 1 1 i 0 1 1 0 1 1 1 1 


Figure 12.37 shows Minitab output for a logistic regression analysis. The estimates of the 


parameters fo and fj are Bo = —3.21107 and B, = 0.177717, respectively. The resulting estimated 
logistic regression function, denoted p(x), is 


bo t+ Bix e321 + 0.1777 


P(x) = 14 efo+ Bix ~ 74 end 21 F017 Tx 


The graph of p(x) is the curve shown in Figure 12.38; notice that the (estimated) probability of 
success increases as x increases. Remember that the logit curve is modeling the mean y value for each 
x value; we do not anticipate that it will intersect the points in the scatterplot. 


Binary Logistic Regression: Success versus Months 


Logistic Regression Table 


Odds 95% CI 
Predictor Coef SE Coef Z P Ratio Lower Upper 
Constant -3.21107 1.23540 -2.60 0.009 
Months 0.177717 0.0657308 2.70 0.007 1.19 1.05 1.36 


Figure 12.37 Logistic regression output from Minitab 
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Figure 12.38 Scatterplot with the fitted logistic regression function for Example 12.33 


We may use p(x) to estimate the likelihood of a junior consultant completing the complex task, 
based upon her/his duration of job experience. For example, 


e73-211 + 0.1777(12) e—3-211 +0.1777(24) 


eos 254 and p(24) = 1 e321 40.1724) 742 


p(i2) = 14 e321 40.1712 


So, it is estimated that a consultant with just one year (12 months) of experience has about a .25 
chance of successfully completing the task, compared to a probability of over .74 for someone with 
two years’ experience. 

The Minitab output includes Sp, under SE Coef. For the “utility test” of Ho: 6; = 0 versus 


H,: B,; # 0, the test statistic value and two-tailed P-value are 
B,-0 1777170 


= _ =2.70 P-value = 2P(Z> 2.70) = 2[1 — ©(2.70)| = .007 
oes, 0657308 value = ek 2 ee he 


The null hypothesis is rejected at the .05 or .01 level, and we conclude that months of experience is a 
useful predictor of a junior consultant’s ability to complete the task. 


The estimated odds ratio is e4 = e°!777 = 1.19. A 95% Cl for f; is given by 


By £ 202583, = .1TTTIT + 1.96(.0657308) = (.04888, 30655) 


from which a 95% CI for the true odds ratio, e1, is (e:°4888, e305) = (1.05, 1.36). The estimated 
odds ratio and the CI all appear in the output. With a high degree of confidence, for each additional 
month of experience, the odds that a consultant can successfully complete the task increase by a 
multiplicative factor of between 1.05 and 1.36, i.e., increase by 5—36%. H 
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Some software packages report the value of the chi-squared statistic z rather than z itself, along 
with the corresponding P-value for a two-tailed test. 


Example 12.34 Here is data on launch temperature (°F) and the incidence of failure for O-rings in 
23 space shuttle launches prior to the Challenger disaster of January 28, 1986. 


Temperature Failure Temperature Failure Temperature Failure 

53 Y 68 N 75 N 
57 Y 69 N 715 Y 
58 Y 70 N 716 N 
63 Y 70 N 76 N 
66 N 70 Y 78 N 
67 N 70 Y 79 N 
67 N 72 N 81 N 
67 N 73 N 


Figure 12.39 shows JMP output from a logistic regression analysis. We have chosen to let p 
denote the probability of an O-ring failure, since this is really the event of interest. Failures tended to 
occur at lower temperatures and successes at higher temperatures, so the graph of p(x) decreases as 
temperature (x) increases. 


1.00 
a = . 
0.755 ". . 
s so # 
. 1 
2 . ° 
= 0.50 
£ . 
= - 
Ss . ae 
0.255 
. 0 
0.00 
50 55 60 65 70 75 80 85 
temp 
Parameter Estimates 
Term Estimate Std Error ChiSquare Prob>ChiSq 
Intercept 15.0422911 7.378391 4.16 0.0415 
temp —0.2321537 0.1082329 4.60 0.0320 


Figure 12.39 Logistic regression output from JMP 


The estimate of (3, is B, = —.2322, and the estimated standard deviation of B , iss > .1082. The 
value of z for testing Hp: f, = 0, which asserts that temperature does not affect the likelihood of 
O-ring failure, is z= By /sp = —.2322/.1082 = —2.15. The P-value is 2P(Z < —2.15)= 
2(.0158) = .032. JMP reports the value of a chi-squared statistic computed as (—2.15)? = 4.60 (there 
is a slight disparity due to rounding in the z value). Either way, the P-value indicates that Hp should 


be rejected at the .05 level and, hence, that temperature at launch time has a statistically significant 
effect on the likelihood of an O-ring failure. Specifically, for each 1 °F increase in launch temper- 


ature, we estimate that the odds of failure are multiplied by a factor of ehi = e 322 ~ 79, i.e., the 
odds are estimated to decrease by 21%. 
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The launch temperature for the Challenger mission was only 31 °F. Because this value is much 
smaller than any temperature in the sample, it is dangerous to extrapolate the estimated relation- 
ship. Nevertheless, it appears that for a temperature this small, O-ring failure is almost a sure thing. 
The logistic regression gives the estimated probability at x = 31 as 


bo + Bi(31) 15.0423—.23215(31) 


p(31) = T+ehothiG) 1+ el5.0423-2321581 99961 


and the odds associated with this probability are .99961/(1 —.99961) ~ 2563. Thus, if the logistic 
regression can be extrapolated down to 31°F, the probability of failure is .99961, the probability of 
success is .00039, and the predicted odds are 2563 to 1 against avoiding an O-ring failure. fa 


Multiple Logistic Regression 

Multiple logistic regression, a natural extension of simple logistic regression, postulates a model for 
relating a categorical response variable to more than one explanatory variable. The explanatory 
variables themselves may be true quantitative predictors or indicator variables coding categorical 
predictors. We continue to restrict attention to a binary response, such as yes/no or happy/sad, which 
may be coded as | or 0 (with 1 indicating the event of interest, i.e., a “success”). 


With predictors x,, ..., x, in the model, let p(x;,...,.x,) denote the true probability of the event of 
interest occurring and assume the following multiple logit function applies: 


eFo + Byxy +o + By xg 


P(X1y «+ %t) = 1 + efo + Bix + + Bx (12.21) 
The multiple logit function (12.21) is the obvious extension of the simple logit function (12.20) to 
accommodate more than one explanatory variable. As in simple logistic regression, this logit function 


can be re-written in terms of the natural log of the odds of the event of interest: 


in( Pl higsvug Xe) 


) = Bot Baa + Ba 
1 — p(x,...,Xx) 


Written this way, the coefficient f; (j = 1, ..., k) is interpreted as the change in the log-odds of the 
event of interest associated with a one-unit increase in x,, after adjusting for the effects of all the other 
predictors in the model. Equivalently, e” is the multiplicative change in odds associated with a one- 
unit increase in x; after accounting for the other k — 1 predictors, i.e., e4i is the odds ratio associated 
with x;. 

Inference procedures in multiple logistic regression are similar to those outlined for simple logistic 
regression. In particular, the point estimators Bo, B [yey By for the unknown f;;’s are based upon the 
principle of maximum likelihood, and each estimator has approximately a normal sampling distri- 
bution provided the sample size n is reasonably large. Several statistical software packages will 
provide point estimates and estimated standard errors for the coefficients, allowing for variable utility 
hypothesis tests as well as confidence intervals. 


Example 12.35 The authors of the article “Building Social Capital in Forest Communities: Anal- 
ysis of New Mexico’s Collaborative Forest Restoration Program” (Natural Resour. J., Fall 2007: 
867-915) analyzed the factors that helped determine which proposals were funded by the 
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Collaborative Forest Restoration Program (CFRP, a federally funded grant program). Data was 
available on 219 proposals made in New Mexico between 2001 and 2006. The response variable of 
interest is 


__ J 1 if the grant proposal was funded 
*~) 0. if the grant proposal was not funded 


We will consider just a few of the predictor variables the authors used in their analysis: x; = amount 
of funding requested by the project (in $1000), x. = percent of county residents living below the 
poverty threshold, and x3 = | if the proposed treatment of private lands was cited by the review panel 
as a weakness of the project (x3 = 0 otherwise). 


Parameter estimates from software are Bo = —1.216, B, = .00156, Bs = .0327, and B, = —2.002. 
Consider a proposal requesting $360,000 (x, = 360) that was not criticized for its proposed treatment 
of private lands (x3 = 0) in a county with a 16.6% poverty rate (x. = 16.6); this exactly matches one 
of the proposals. Then the estimated log-odds of the project being funded are 


—1.216 + .00156(360) + .0327(16.6) — 2.002(0) = —.11158 


and the estimated probability of being funded is 


—.11158 


* e 


(For the record, that particular proposal was funded!) 

Funding request amount had little practical impact on whether a project was funded: adjusting for 
poverty rate and private land use consideration, a $1000 (one-unit) increase in requested funding 
actually increased the estimated odds of acceptance by e°°!° = 1.0016, i.e., by .16%. In contrast, 
criticism for private land treatment was a veritable death-knell: removing the effects of the other two 
variables, odds of acceptance when x3 = 1 are e~7-°? = .1351 times the acceptance odds when 
x3 = 0. In other words, if a proposal was criticized in this way, the odds of acceptance were reduced 
by more than 86%. | 


A model utility test of Ho: 6; =--: = 6, =0 versus H,: not all f’s are zero is based on the 
likelihood ratio test statistic A presented in Section 9.5; most statistical software packages will 
include the test statistic value and P-value when multiple logistic regression is performed. (The test is 
based on the large-sample approximation mentioned at the end of Section 9.5, whereby —2In(A) has 
approximately a chi-squared distribution with k df.) 

The logit functions in (12.20) and (12.21) are not the only choices for modeling the probability of 
success. Two other popular options are the probit and complimentary log-log functions, both of 
which are implemented in many software packages. The relative suitability of these functions to 
fitting a particular data set can be assessed using various automated “goodness-of-fit” procedures, 
including the deviance test and the Hosmer-Lemeshow test. Consult the text by Kutner et al. listed in 
the bibliography for more information. 
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Exercises: Section 12.10 (111-120) 


111. 


112. 


113. 


A major electronics retailer sensibly 
believes that customers are more likely to 
redeem an emailed coupon if it’s worth 
more money. With x = coupon discount 
amount ($), and Y=1 if a customer 
redeems the coupon, consider a logistic 
regression model with fo = —3.75 and 
fb, =0.1. 
a. Calculate and interpret both p(10) and 
p(50). 
b. Calculate the odds that a $10 coupon is 
redeemed, then repeat for $50. 
Interpret /, in this context. 
d. According to this model, for what dis- 
count amount is there a 50-50 chance 
the coupon will be redeemed? 


o 


In Example 12.32, the probability of cancer 

metastasizing was given by the logistic 

regression model with f y= -2 and 

fb, = 0.5. 

a. Tabulate values of x, p(x), the odds 
p(x)/[1 — p(x)], and the log-odds for 
x = 2, 3,4, ..., 9. (In the cited article, 
tumor sizes ranged from 1.7 to 9.0 cm.) 

b. Explain what happens to the odds when 
x is increased by 1. Your explanation 
should involve the .5 that appears in the 
formula for p(x). 

c. Support your answer to (b) alge- 
braically, starting from the formula for 
Pp). 

d. For what value of x are the odds 1? 5? 
10? 

Adolescents are getting less sleep that ever 

before, and this can have serious behavioral 

repercussions. The article “Dose-Dependent 

Associations Between Sleep Duration and 

Unsafe Behaviors Among US High-School 

Students” (JAMA Pediatr. 2018: 1187- 

1189) reported a large-scale study of 

American teenagers. The investigators fit a 

simple logistic regression model with the 

response variable y = | if a teenager had 

driven drunk in the last 30 days (and 0 

otherwise), and x = typical number of hours 
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of sleep per night. Information in the article 
suggests B, = —.1998 and 53, = 0986. 


a. Test whether sleep has an effect on the 
likelihood of driving drunk among 
American teenagers, at the .05 signifi- 
cance level. 

b. Calculate a 95% confidence interval for 
et 

c. Interpret the confidence interval from 
part (b) in terms of a one-hour decrease 
in sleep. 


The pharmaceutical industry has increas- 
ingly developed “nanoformulations” for 
drug delivery, but quality control at such a 
small scale is tricky. The article “Quality by 
Design Approach Using Multiple Linear 
and Logistic Regression Modeling Enables 
Microemulsion Scale Up” (Molecules 
2019) describes one study to determine 
how x=oil concentration (g/100 mL) 
affects whether a development run meets a 
certain critical quality attribute (CQA) with 
respect to polydispersity. Here, y = 1 if the 
CQA was achieved and = 0 if not. 


60 60 20 40 60 20 2.0 
0 0 1 1 0 1 1 1 1 0 
20 20 60 20 40 60 2.0 
1 1 0 1 0 0 1 0 1 1 
45 20 20 60 20 20 2.0 
0 1 1 0 1 1 1 1 0 1 


Software reports coefficients Bo = 11.13 


and B, = —2.68 with estimated standard 

errors 5.96 and 1.42, respectively. 

a. Write out the estimated logit function, 
and use it to estimate p(2), p(4), and 
p(6). 

b. Does the data provide convincing sta- 
tistical evidence that oil concentration 
affects the chance of meeting this par- 
ticular CQA? Test at the .05 signifi- 
cance level. 
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115. 


116. 


Acceptable 


Unacceptable 


c. Construct a 95% Cl for f1, then use this 
to give an interval estimate for e/'. 
Interpret the latter interval. 

d. What does e4 represent in this context? 
Does that interpretation seem appro- 
priate here? Why or why not? 


Kyphosis, or severe forward flexion of the 
spine, may persist despite corrective spinal 
surgery. A study carried out to determine 
risk factors for kyphosis reported the fol- 
lowing ages (months) for 40 subjects at the 
time of the operation; the first 18 subjects 
did have kyphosis and the remaining 22 did 
not. 


Kyphosis 12 15 42 52 59 73 


No kyphosis 1 1 2 8 11 18 


a. Use software to fit a logistic regression 
model to this data. 


b. Interpret the coefficient B,. [Hint: It 
might be more sensible to work in 


terms of ef | 

c. Test whether age has a statistically 
significant impact on the presence of 
kyphosis. 


Exercise 16 of Chapter | presented data on 
the noise level (dBA) for 77 individuals 
working at a particular office. In fact, each 
person was also asked whether the noise 
level in the office was acceptable or not. 


55.3 
56.1 
57.0 
58.8 
65.3 
63.8 
64.7 
68.7 
73.1 
79.3 


39.3 
56.1 
57.0 
58.8 
65.3 
63.8 
65.1 
68.7 
74.6 
79.3 


55.3 
56.1 
57.8 
58.8 
65.3 
63.8 
65.1 
68.7 
74.6 
83.0 


55.9 
56.1 
57.8 
59.8 
65.3 
63.9 
65.1 
70.4 
74.6 
83.0 


55.9 
56.1 
57.8 
59.8 
68.7 
63.9 
67.4 
70.4 
74.6 
83.0 


55.9 
56.8 
57.9 
59.8 
69.0 
63.9 
67.4 
71.2 
79.3 


55.9 
56.8 
57.9 
62.2 
73.0 
64.7 
67.4 
2 
79.3 


56.1 
57.0 
S79 
62.2 
73.0 
64.7 
67.4 
73.1 
79.3 


a. Use software to fit a logistic regression 
model to this data. 


117. 


118. 


815 
b. Interpret the coefficient Bis [Hint: It 
might be more sensible to work in 


terms of e/1 | 
c. Construct and interpret a 95% confi- 
dence interval for e?', 


The article “Consumer Attitudes Toward 
Genetic Modification and Other Possible 
Production Attributes for Chicken” (J. Food 
Distr. Res. 2005: 1-11) reported a survey of 
498 randomly selected consumers concern- 
ing their views on genetically modified 
(GM) food. The researchers’ goal was to 
model the response variable Y = 1 if a con- 
sumer wants GM chicken products labeled 
(and 0 otherwise) as a function of x; = con- 
sumer’s age (yr), x2 = income ($1000s), sex 
(x3 = 1 if female), and whether there are 
children in the consumer’s household 
(x4 = 1 if yes). Estimated model parameter 
values are Re = .8247, B, = .0073, 
By = 0041, B, = .9910, and B, = .0224. 
a. Estimate the likelihood that a consumer 
wants GM chicken products labeled if 
that person is a 35-year-old female with 
$65,000 annual income and no 
children. 
b. Repeat part (a) for a 35-year-old male 
(keep other features the same). 
c. Interpret the coefficient on age. 
d. Interpret the coefficient on the sex 
indicator variable. 


Road trauma is the leading cause of death 
and injury among young people in Aus- 
tralia. The article “The Journey from Traffic 
Offender to Severe Road Trauma Victim: 
Destiny or Preventive Opportunity?” (PLoS 
ONE, April 22, 2015) reported a study to 
determine factors that might help predict 
future serious accidents. The article inclu- 
ded estimated odds ratios and 95% CIs for 
true odds ratios for several variables. The 
response variable here has value y = | if a 
subject was in an accident leading to 
intensive care admission or death, and 0 
otherwise. 
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Est.OR OR 95% CI signify? [Hint: The latter should not be 
x = age/10 1.02 (1.01, 1.03) surprising. ] 
X2 = | if male, 0 if female 1.18 (0.98, 1.42) c. Estimate the probability of spotting a 
x3 = years with a driver’s license 0.99 (0.98, 0.99) 


humpback whale during a 30-minute 


x4 = number of prior traffic offenses 1.10 (1.08, 1.11) F 
tour one week (i.e., seven days) after 


a. Which of these four explanatory vari- the final salmon release. 
ables were associated with a decreased d. The estimated standard errors of the 
likelihood of later severe road trauma? coefficients are SR, = 253 and 
How can you tell? sq, = 120. Perform variable utility 


b. Which of these four explanatory variables 
were not statistically significant predic- 
tors in this model? How can you tell? 


tests at the .1 significance level. 
e. Interpret both e~-°° and e7!° in this 


context. 

c. Interpret the 95% CI provided for e*. 199 The article “Developing Coal Pillar Stabil- 
119. Whale-watching is big business in Alaska, ity Chart Using Logistic Regression” 

particularly around salmon release sites (J. Rock Mech. Mining Sci. 2013: 55-60) 

where whales tend to congregate. The includes the following data on x, = height— 

article “Humpback Whales Feed on width ratio, x. = strength—stress ratio, and 

Hatchery-Released Juvenile Salmon” (Roy. y = | (stable) or 0 (not stable) for 29 pillars 

Soc. Open Sci. 2017) reported a study to used to stabilize current and former mines 

determine what factors help predict the in India. 

likelihood of spotting a humpback whale 

when visiting one of these sites. The fol- x, | 180 165 2.70 367 141 1.76 2.10 2.10 

lowing data on x, = days after final salmon «» 240 2.54 0.84 168 241 1.93 1.77. 1.50 

release, x. = duration of visit (min), and y 1 1 1 1 1 1 1 1 


whether a whale was sighted are from a 
recent year at Little Port Walter. 


Xx X2 Whale? Xy X2 Whale? 

2 15 Y 7 15 N x, | 0.60 130 0.83 057 144 2.08 1.50 1.38 
! 15 N 7 1 N x 127 087 0.97 0.94 1.00 0.78 1.03 0.82 
1 15 N 8 15 N 

2 15 N 8 15 N y 0 0 0 0 0 0 0 0 
3 15 N 9 15 N 

- 13 iN aa a x, | 0.94 158 1.67 3.00 2.21 

4 15 N 1015 N x | 130 0.83 1.05 1.19 0.86 

5 30 N 12 15 N 

5 15 N 12 15 N eM ? 0 ! . 

6 15 N 13.15 N ' : er . 

6 15 N B 35 Y a. Fit a multiple logistic regression model 

to this data, and report the estimated 

(The full study investigated five sites across logit equation. 
several years before, during, and after sal- b. Perform the two variable utility tests 
mon release.) Ho: B, = 0 and Ho: B2 = 0, each at the 


.1 significance level. 

c. Calculate the predicted probability of 
stability for pillar 8 (x, =2.10, 
X_ = 1.50). 

d. Calculate the predicted probability of 
stability for pillar 28 (x, = 3.00, 
X_ = 1.19). 


a. Use software to fit a multiple logistic 
regression model to this data, and 
confirm that the estimated log-odds 
function is —5.68 — .096x, + .210x>. 

b. What does the negative sign for the 
coefficient —.096 signify? What does 
the positive sign for the coefficient .210 
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Supplementary Exercises: (121-138) 


121. 


122. 


In anticipation of future floods, insurance 
companies must quantify the relationship 
between water depth and the amount of 
flood damage that will occur. The Federal 
Insurance Administration provided the fol- 
lowing information on x = depth of flooding 
(in feet above first-floor level) and y = flood 
damage (as a percentage of structural value) 
for homes with no basements. 


Flood Flood Flood Flood 
level (x) damage (y) level (x) damage (y) 
0 a 8 44 

1 10 9 45 

2 14 10 46 

3 26 11 47 

4 28 12 48 

3 29 13 49 

6 Al 14 50 

a 43 


a. Create a scatterplot of the data, and 
briefly describe what you see. 


b. Does a straight-line relationship seem 
appropriate for this data? Why or why 
not? 


The article “Exhaust Emissions from Four- 
Stroke Lawn Mower Engines” (J. Air 
Water Manage. Assoc. 1997: 945-952) 
reported data from a study in which both a 
baseline gasoline mixture and a reformu- 
lated gasoline were used. Consider the 
following observations on age (year) and 
NO, emissions (g/kWh): 


Engine 1 2 3 4 5 
Age 0) 0 2 11 7 
Baseline 1.72 438 4.06 1.26 5.31 
Reformulated 1.88 5.93 5.54 2.67 6.53 
Engine 6 7 8 9 10 
Age 16 9 0 12 4 
Baseline 57 3.37 344 .74 1.24 


Reformulated  .74 494 489 .69 1.42 


a. Construct a scatterplot of baseline vs. 
reformulated NO, emissions. Comment 
on what you find. 
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b. Construct scatterplots of NO, emissions 
versus age. What appears to be the 
nature of the relationship between these 
two variables? 


123. The presence of hard alloy carbides in high 


chromium white iron alloys results in 
excellent abrasion resistance, making them 
suitable for materials handling in the min- 
ing and materials processing industries. The 
accompanying data on x = retained 
austenite content (%) and y = abrasive wear 
loss (mm*) in pin wear tests with garnet as 
the abrasive was read from a plot in the 
article “Microstructure-Property Relation- 
ships in High Chromium White Iron 
Alloys” (Internat. Mater. Rev. 1996: 
59-82). Refer to the accompanying SAS 
output. 


x| 4.6 17.0 17.4 18.0 18.5 22.4 26.5 30.0 34.0 
y 66 92 145 1.03 .70 .73 1.20 .80 91 


x 38.8 48.2 63.5 65.8 73.9 77.2 79.8 84.0 
y 119 1.15 1.12 1.37 1.45 1.50 1.36 1.29 


a. What proportion of observed variation 
in wear loss can be attributed to the 
simple linear regression model 
relationship? 

b. What is the value of the sample corre- 
lation coefficient? 

c. Test the utility of the simple linear 
regression model using « = .01. 

d. Estimate the true average wear loss 
when content is 50% and do so in a way 
that conveys information about relia- 
bility and precision. 

e. What value of wear loss would you 
predict when content is 30%, and what 
is the value of the corresponding 
residual? 
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Analysis of variance 
Source DF Sum of squares Mean square F Value Prob > F 
Model a. 0.63690 0.63690 15.444 0.0013 
Error 15 0.61860 0.04124 
C Total 16 1.25551 
Root MSE 0.20308 R-square 0)..5073 
Dep Mean 1.10765 Adj R-sq 0.4744 
c.V. 18.33410 
Parameter estimates 
Variable DF Parameter estimate Standard error T for HO: Parameter = 0 Prob > |T| 
INTERCEP alt 0.787218 0.09525879 8.264 0.0001 
AUSTCONT all 0.007570 0.00192626 3.930 0.0013 


124. An investigation was carried out to study 


126. 


the relationship between speed (ft/s) and 
stride rate (number of steps taken/s) among 
female marathon runners. Resulting sum- 
mary quantities included n = 11, )“(speed) 
= 205.4, y_(speed)” = 3880.08, > “(rate) = 
35.16, i(rate? = 112.681, and 
>i (speed)(rate) = 660.130. 


a. Calculate the equation of the least 
squares line that you would use to pre- 
dict stride rate from speed. [Hint: x = 
Yo x/n and similarly for y; Sy = 
Yo xvi — 5%:)(95 y;)/n and similarly 
for S,, and S,,.] 

b. Calculate the equation of the least 
squares line that you would use to pre- 
dict speed from stride rate. 

c. Calculate the coefficient of determina- 
tion for the regression of stride rate on 
speed of part (a) and for the regression 
of speed on stride rate of part (b). How 
are these related? 

d. How is the product of the two slope 
estimates related to the value calculated 
in (c)? 

125. Suppose that x and y are positive vari- 
ables and that a sample of n pairs results in 
r & 1. Ifthe sample correlation coefficient is 
computed for the (x, y”) pairs, will the 
resulting value also be approximately 1? 
Explain. 

In Section 12.4, we presented a formula for 

the variance V(B)+,x") and a CI for 

fo + B\x*. Taking x* = 0 gives oF. and a CI 


for Bo. Use the data of Example 12.18 to 
calculate the estimated standard deviation 


127. 


of Bo and a 95% CI for the y-intercept of the 
true regression line. 


In biofiltration of wastewater, air dis- 
charged from a treatment facility is passed 
through a damp porous membrane that 
causes contaminants to dissolve in water 
and be transformed into harmless products. 
The accompanying data on x = inlet tem- 
perature (°C) and y = removal efficiency 
(%) was the basis for a scatterplot that 
appeared in the article “Treatment of Mixed 
Hydrogen Sulfide and Organic Vapors in a 
Rock Medium Biofilter” (Water Environ. 
Res. 2001: 426-435). 


Temp Removal % Obs Temp Removal % 
7.68 98.09 17 8.55 98.27 
6.51 98.25 18 7.57 98.00 
6.43 97.82 19 6.94 98.09 
5.48 97.82 20 8.32 98.25 
6.57 97.82 21 10.50 98.41 

10.22 97.93 22 16.02 98.51 

15.69 98.38 23 17.83 98.71 

16.77 98.89 24 17.03 98.79 

17.13 98.96 25 16.18 98.87 

17.63 98.90 26 16.26 98.76 

16.72 98.68 27 14.44 98.58 

15.45 98.69 28 12.78 98.73 

12.06 98.51 29 12.25 98.45 

11.44 98.09 30 11.69 98.37 

10.17 98.25 31 11.34 98.36 
9.64 98.36 32 10.97 98.45 


Calculated summary quantities are )> x; 

384.26, >> yi = 3149.04, S,.= 485.00, S.y = 

36.71, and S,,, = 3.50. 

a. Does a scatterplot of the data suggest 
appropriateness of the simple linear 
regression model? 


b. Fit the simple linear regression model, 
obtain a point prediction of removal 
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efficiency when temperature 10.50, 
and calculate the value of the corre- 
sponding residual. 

c. Roughly what is the size of a typical 

deviation of points in the scatterplot 

from the least squares line? 

What proportion of observed variation 

in removal efficiency can be attributed 

to the model relationship? 

e. Estimate the slope coefficient in a way 
that conveys information about relia- 
bility and precision, and interpret your 
estimate. 

f. Personal communication with the 
authors of the article revealed that one 
additional observation was not included 
in their scatterplot: (6.53, 96.55). What 
impact does this additional observation 
have on the equation of the least squares 
line and the values of s and R?? 


128. Normal hatchery processes in aquaculture 


y 


y 


40 36 3.7 40 3.8 4.0 5.1 3.9 


5.8 43 5.5 5.6 5.1 5.7 61 5.1 


inevitably produce stress in fish, which may 
negatively impact growth, reproduction, 
flesh quality, and susceptibility to disease. 
Such stress manifests itself in elevated and 
sustained corticosteroid levels. The article 
“Evaluation of Simple Instruments for the 
Measurement of Blood Glucose and Lac- 
tate, and Plasma Protein as Stress Indicators 
in Fish” (J. World Aquacult. Soc. 1999: 
276-284) described an experiment in which 
fish were subjected to a stress protocol and 
then removed and tested at various times 
after the protocol had been applied. The 
accompanying data on x = time (min) and 
y = blood glucose level (mmol/L) was read 
from a plot. 


2 2 5 7 12 13 17 #18 23 24 26 28 


44 43 43 44 


29 30 34 36 40 41 44 56 56 57 60 60 


5.9 68 4.9 5.7 


Use the methods developed in this chapter 
to analyze the data, and write a brief report 
summarizing your conclusions (assume that 
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the investigators are particularly interested 
in glucose level 30 min after stress). 


The article “Evaluating the BOD POD for 
Assessing Body Fat in Collegiate Football 
Players” (Med. Sci. Sports Exerc. 1999: 
1350-1356) reports on a new air displace- 
ment device for measuring body fat. The 
customary procedure utilizes the hydro- 
static weighing device, which measures the 
percentage of body fat by means of water 
displacement. Here is representative data 
read from a graph in the paper. 


a. Use various methods to decide whether it 
is plausible that the two techniques mea- 
sure on average the same amount of fat. 

. Use the data to develop a way of pre- 
dicting an HW measurement from a 
BOD POD measurement, and investigate 
the effectiveness of such predictions. 


25 4.0 4.1 6.2 TA 7.0 
8.0 6.2 9.2 6.4 8.6 12.2 
8.3 9.2 9.3 12.0 12.2 

7.2 12.0 14.9 12.1 15.3 

12.6 14.2 14.4 15.1 15.2 

14.8 14.3 16.3 17.9 19.5 

16.3 17.1 17.9 17.9 

17.5 14.3 18.3 16.2 


Reconsider the situation of Exercise 123, in 
which x = retained austenite content using a 
garnet abrasive and y = abrasive wear loss 
were related via the simple linear regression 
model Y = fo + B,x + &. Suppose that for a 
second type of abrasive, these variables are 
also related via the simple linear regression 
model Y= yo + 1x + € and that V(g) = a” for 
both types of abrasive. If the data set con- 
sists of n, observations on the first abrasive 
and nz on the second and if SSE, and 
SSE, denote the two error sums of squares, 
then a_ pooled estimate of a is 
= (SSE; + SSE2)/(m1 + nz — 4). Let 
SS. and SS,> denote S> (x; a for the 
data on the first and second abrasives, 
respectively. A test of Ho: By — 71 0 
(equal slopes) is based on the statistic 
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When Hp is true, T has a ¢ distribution with 
n, + nz — 4 df. Suppose the 15 observa- 
tions using the alternative abrasive give 
SS. = 7152.5578, }, = .006845, and SSE, 
= .51350. Using this along with the data of 
Exercise 123, carry out a test at level .05 to 
see whether expected change in wear loss 
associated with a 1% increase in austenite 
content is identical for the two types of 
abrasive. 


Show that the ANOVA version of the 
model utility test discussed in Section 12.3 
(with test statistic F = MSR/MSE) is in fact 
a likelihood ratio test for Hp: 8, = 0 versus 
H,: B; 4 0. [Hint: We have already pointed 
out that the least squares estimates of fo 
and f, are the mle’s. What is the mle of fo 
when Ho is true? Now determine the mle of 
o° both in Q (when f is not necessarily 0) 
and in Q, (when Hp is true).] 


Show that the ¢ ratio version of the model 
utility test is equivalent to the ANOVA 
F statistic version of the test. Equivalent 
here means that rejecting Ho: 6; = 0 when 
either ¢ > ty n-2 OF t < —tyo,-2 1s the 
same as rejecting Hy when f > Fy4,-2. 


When a scatterplot of bivariate data shows a 
pattern resembling an _ exponentially 
increasing or decreasing curve, the follow- 
ing multiplicative exponential model is 
often used: Y = ae* - «. 


a. What does this multiplicative model 
imply about the relationship between 
Y’ = In(Y) and x? [Hint: take logs on both 
sides of the model equation and let Bo = 
In(a), 6; =f, e'=In(e), and suppose that ¢ 
has a lognormal distribution. ] 


b. The accompanying data resulted from 
an investigation of how road pulse 
duration (y, in ms, a measure of struc- 
tural stress) varied with asphalt depth (x, 
in mm) in a simulation of large trucks 
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driving 40 mph (“Comparative Study of 
Asphalt Pavement Responses 
Under FWD and Moving Vehicular 
Loading,” J. Transp. Engr. 2016). 


x 40 40 190 190 267 267 420 420 


y 25 36 53 55 78 91 168 = =201 


Fit the simple linear regression model 
to this data, and check model adequacy 
using the residuals. 

c. Is a scatterplot of the data consistent 
with the exponential regression model? 
Fit this model by first carrying out a 
simple linear regression analysis using 
In(y) as the response variable and x as 
the explanatory variable. How good a fit 
is the simple linear regression model 
to the “transformed” data (i.e., the 
(x, In(y)) pairs)? What are point esti- 
mates of the parameters « and f? 


d. Obtain a 95% prediction interval for 
pulse duration when asphalt thickness is 
250 mm. [Hint: first obtain a PI for 
In(y) based on the simple linear regres- 
sion carried out in (c).] 


134. No tortilla chip aficionado likes soggy 


chips, so it is important to identify charac- 
teristics of the production process that 
produce chips with an appealing texture. 
The following data on x = frying time 
(s) and y = moisture content (%) appeared 
in the article “Thermal and Physical Prop- 
erties of Tortilla Chips as a Function of 
Frying Time” (J. Food Process. Preserv. 
1995: 175-189). 


x 5 10 15 20 25 30 45 60 

y 163 97 81 42 34 29 19 13 

a. Construct a scatterplot of the data and 
comment. 


b. Construct a scatterplot of the pairs 
(n(x), In(y)) G.e., transform both x and 
y by logs) and comment. 

c. Consider the multiplicative power 
model Y = ax’. What does this model 
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imply about the relationship between 
y' = In(y) and x’ = In(x) (assuming that ¢ 
has a lognormal distribution)? 

d. Obtain a prediction interval for moisture 
content when frying time is 25 s. [Hint: 
first carry out a simple linear regression 
of y’ on x’ and calculate an appropriate 
prediction interval.] 


Forest growth and decline phenomena 
throughout the world have attracted con- 
siderable public and scientific interest. The 
article “Relationships Among Crown Con- 
dition, Growth, and Stand Nutrition in 
Seven Northern Vermont Sugarbushes” 
(Canad. J. Forest Res. 1995: 386-397) 
included a scatterplot of y = mean crown 
dieback (%), one indicator of growth 
retardation, and x = soil pH (higher pH 
corresponds to less acidic soil), from which 
the following observations were taken: 


3.3 34 3.4 
7.3 10.8 13.1 


35 36 36 3.7 3.7 38 3.8 
10.4 5.8 9.3 12.4 14.9 11.2 8.0 
3.9 40 4.1 
6.6 10.0 9.2 


42 43 44 4.5 
12.4 2.3 4.3 3.0 


5.0 
1.6 


5.1 
1.0 


a. Construct a scatterplot of the data. What 
model is suggested by the plot? 

b. Use a statistical software package to fit 
the model suggested in (a) and test its 
utility. 

c. Use the software package to obtain a 
prediction interval for crown dieback 
when soil pH is 4.0, and also a confi- 
dence interval for expected crown die- 
back in situations where the soil pH is 
4.0. How do these two intervals com- 
pare to each other? Is this result con- 
sistent with what you learned in simple 
linear regression? Explain. 

d. Use the software package to obtain a PI 
and CI when x = 3.4. How do these 
intervals compare to the corresponding 
intervals obtained in (c)? Is this result 
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consistent with what you learned in 
simple linear regression? Explain. 


The article “Validation of the Rockport 
Fitness Walking Test in College Males and 
Females” (Res. Q. Exerc. Sport 1994: 152- 
158) recommended the following estimated 
regression equation for relating y = 
VO.max (L/min, a measure of cardiores- 
piratory fitness) to the predictors x, = gen- 
der (female = 0, male = 1), x2 = weight (1b), 
x3 = 1-mile walk time (min), and x4 = heart 
rate at the end of the walk (beats/min): 


y = 3.5959 + .65661x; + .0096x2 
— .0996x3 — .0080x4 


a. How would you interpret the estimated 
coefficient —.0996? 

b. How would you interpret the estimated 
coefficient .6566? 

c. Suppose that an observation made on a 
male whose weight was 170 Ib, walk 
time was 11 min, and heart rate was 140 
beats/min resulted in VO max = 3.15. 
What would you have predicted for 
VO,max in this situation, and what is 
the value of the corresponding residual? 

d. Using SSE = 30.1033 and SST = 
102.3922, what proportion of observed 
variation in VO .max can be attributed 
to the model relationship? 

e. Assuming a sample size of n = 20, carry 
out a test of hypotheses to decide whe- 
ther the chosen model specifies a useful 
relationship between VO max and at 
least one of the predictors. 


Investigators carried out a study to see how 
various characteristics of concrete are 
influenced by x, = % limestone powder and 
X2 = water—cement ratio, resulting in the 
accompanying data (“Durability of Con- 
crete with Addition of Limestone Powder,” 
Mag. Concr. Res. 1996: 131-137). 


x2 

.65 
33 
65 
55 
.60 
.60 
.70 
50 
.60 


a. 


28-day comp str. (MPa) —Adsorbability (%) 


33.55 8.42 
47.55 6.26 
35.00 6.74 
35.90 6.59 
40.90 7.28 
39.10 6.90 
31.55 10.80 
48.00 5.63 
42.30 7.43 


Consider first compressive strength as 
the dependent variable y. Fit a first-order 
model, and determine R?. 

Determine the adjusted R* value for a 
model including the interaction term and 
also for the complete second-order model. 
Of the three models in parts (a)-(b), which 
seems preferable? 

Use the “best” model from part (b) to 
predict compressive strength when % 
limestone = 14 and water—cement ratio 
= .60. 

Repeat parts (a)-(b) with adsorbability 
as the response variable. That is, fit 
three models: the first-order model, one 
with first-order terms plus an interac- 
tion, and the complete second-order 
model. 
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138. A sample of n = 20 companies was selec- 
ted, and the values of y = stock price and 
k = 15 predictor variables (such as quarterly 
dividend, previous year’s earnings, and 
debt ratio) were determined. When the 
multiple regression model using these 15 
predictors was fit to the data, R* = .90 
resulted. 

a. Does the model appear to specify a 
useful relationship between y and the 
predictor variables? Carry out a test 
using significance level .05. [Hint: The 
F critical value for 15 numerator and 4 
denominator df is 5.86.] 

b. Based on the result of part (a), does a 
high R? value by itself imply that a 
model is useful? Under what circum- 
stances might you be suspicious of a 
model with a high R* value? 

c. With n and k as given previously, how 
large would R* have to be for the model 
to be judged useful at the .05 level of 
significance? 


®) 


Check for 
updates 


Introduction 

In the simplest type of situation considered in this chapter, each observation in a sample is classified 
as belonging to one of a finite number of categories—for example, blood type could be one of the 
four categories O, A, B, or AB. With p; denoting the probability that any particular observation 
belongs in category i, we wish to test a null hypothesis that completely specifies the values of all the 
p's (such as Ho: py = .45, p2 = .35, p3 = .15, pg = .05). Other times, the null hypothesis specifies that 
the p;’s depend on some smaller number of parameters without specifying the values of these 
parameters; the values of any unspecified parameters must then be estimated from the sample data. In 
either case, the test statistic will be a measure of the discrepancy between the observed numbers in the 
categories and the expected numbers when A is true. This method, called a chi-squared test and 
presented in Section 13.1, can also be applied to test the null hypothesis that the sample comes from a 
particular probability distribution. 

Chi-squared tests for two different situations are presented in Section 13.2. In the first, the null 
hypothesis states that the p,’s are the same for several different populations. The second type of 
situation involves taking a sample from a single population and cross-classifying each individual with 
respect to two different categorical factors (such as religious preference and political party registra- 
tion). The null hypothesis in this situation is that the two factors are independent within the 
population. 


13.1 Goodness-of-Fit Tests 


Recall that a binomial experiment consists of n independent trials in which each trial can result in one 
of two possible outcomes, S (for success) and F (for failure). The probability of success is assumed to 
be constant from trial to trial, and n is fixed at the outset of the experiment. A multinomial 
experiment generalizes the binomial experiment by allowing each trial to result in one of k possible 
outcomes, where k > 2. For example, suppose a store accepts three different types of credit cards: 
Visa, MasterCard, and American Express. A multinomial experiment would result from observing the 
type of credit card used—Visa, MC, or Amex—by each of 50 randomly selected customers who pay 
with a credit card. 
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DEFINITION A multinomial experiment satisfies the following conditions: 


1. The experiment consists of a sequence of n trials, where n is fixed in advance of the 
experiment. 

2. Each trial can result in one of the same k possible outcomes (also called categories). 

3. The trials are independent, so that the outcome on any particular trial does not 
influence the outcome on any other trial. 

4. The probability that a trial results in category i is p;, which remains constant from 
trial to trial. 

The parameters pj, ..., pz; must of course satisfy p; > 0 and >> p; = 1. 


If the experiment consists of selecting n individuals or objects from a population and categorizing 
each one, then p; is interpreted as the proportion of the population falling in the ith category; such an 
experiment will be approximately multinomial provided that n is much smaller than the population 
size. In the aforementioned example, k = 3 (number of categories = number of credit cards accepted), 
n = 50 (number of trials = number of customers), and p; denotes the proportion of all credit card 
purchases made with type i (1 = Visa, 2 = MC, 3 = Amex). 

The null hypothesis of interest at this point will specify the value of each p;. For example, suppose 
the store manager believes 50% of all credit card customers use Visa, 30% MasterCard, and the 
remaining 20% American Express. This belief can be expressed as the assertion 


Ao: pi = 5, p2 = 3, p3 = .2 


The alternative hypothesis will state that Ho is not true—i.e., that at least one of the p,’s has a value 
different from that asserted by Hp (in which case at least two must be different, since they sum to 1). 
The symbol p,o will represent the value of p; claimed by the null hypothesis. In the example just 
given, Pio = .5, P29 = -3, and p39 = .2. (The symbol pj is read “p one naught” and not “p ten.”) 

Before the multinomial experiment is performed, the number of trials that will result in the ith 
category (i = 1, 2, ..., or k) is a random variable—just as the number of successes and the number of 
failures in a binomial experiment are random variables. This random variable will be denoted by N; 
and its observed value by n;. Since each trial results in exactly one of the k categories, )* Nj = n, and 
the same is true of the n,’s. As an example, an experiment with n = 50 and k = 3 might yield N, = 22, 
N>2 = 13, and N3 = 15. The N,’s (or n;’s) are called the observed counts. 

When the null hypothesis Ho: p) = pio,---;Pe = Po 1S true, the expected number of trials 
resulting in category 7 is 


E(N;) = (total number of trials) (hypothesized probability of category i) = npio 


These are the expected counts under Hp. For the case Ho: py = .5, po = .3, p3 = .2 and n = 50, 
E(N) = 25, E(N2) = 15, and E(N3) = 10 when Ap is true. The expected counts, like the observed 
counts, sum to n. It is customary to display both sets of counts, with the expected counts under Ho in 
parentheses below the observed counts. The counts in the credit card situation under discussion would 
be displayed in tabular format as 


Credit card Visa MC Amex 


Observed count 22 13 15 
Expected count (25) (15) (10) 
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A test procedure requires assessing the discrepancy between the observed and expected counts, 
with Hp being rejected when the discrepancy is sufficiently large. The test statistic, originally pro- 
posed by Karl Pearson around 1900, is 


S (observed count — expected count)” = 5 (N; — npio)” 


13.1 
on (13.1) 


: expected count 
all categories 


The numerator of each term in the sum is the squared difference between observed and expected 
counts. The more these differ within any particular category, the larger will be the contribution to the 
overall sum. The reason for including the denominator term will be explained shortly. Since the 
observed counts (the N;’s) are random variables, their values depend on the specific sample collected, 
and the test statistic (13.1) will vary in value from sample to sample. Larger values of the test statistic 
indicate a greater discrepancy between the observed and expected counts, making us more apt to 
reject Hp. The approximate sampling distribution of (13.1) is given in the following theorem. 


PEARSON’S When Ho: pi = Pio,---;Pk = Pro is true, the statistic 
CHI-SQUARED : ; 
THEOREM > (Ni = pio) 


has approximately a chi-squared distribution with k — 1 df. This approximation 
is reasonable provided that npj) > 5 for every i (i = 1, 2, ..., k). 


The chi-squared distribution was introduced in Chapter 6 and used in Chapter 8 to obtain a confi- 
dence interval for the variance of a normal population. Recall that the chi-squared distribution has a 
single parameter v, called the number of degrees of freedom (df) of the distribution. Analogous to the 
critical value ¢,., for the ¢ distribution, en , is the value such that « of the area under the ya curve with 
v df lies to the right of a , (see Figure 13.1). Selected values of re y are given in Appendix Table A.5. 


Notice that, unlike a z or ¢t curve, the chi-squared distribution is positively skewed and only takes on 
nonnegative values. 


2 
X curve 


Shaded area =a 


0 2 
X ay 


Figure 13.1 A critical value for a chi-squared distribution 
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The fact that df = k — 1 in the preceding theorem is a consequence of the restriction }> Nj =n: 
although there are k observed counts, once any k — | are known, the remaining one is uniquely 
determined. That is, there are only k — 1 “freely determined” cell counts, and thus k — 1 df. 


CHI-SQUARED Null hypothesis: Ho: Pi = Pio - - +» Pk = Po 
GOODNESS-OF-FIT Alternative hypothesis: #H,: at least one p; does not equal pio 
TEST Test statistic value: (mi - npio)” 
a ree 
i=1 NPi0 


Rejection Region for Level « Test P-value Calculation 


| ae or area under v4 curve to the right 
of the calculated 7? 


The term “goodness-of-fit” refers to the idea that we wish to see how well the observed counts of a 
categorical variable “fit” a set of hypothesized population proportions. Appendix Table A.5 provides 
upper-tail critical values at five « levels for each different v. Because this is not sufficiently granular 
for accurate P-value information, we have also included Appendix Table A.10, analogous to 
Table A.7, that facilitates making more precise P-value statements. 


Example 13.1 If we focus on two different characteristics of an organism, each controlled by a 
single gene, and cross a pure strain having genotype AABB with a pure strain having genotype aabb 
(capital letters denoting dominant alleles and small letters recessive alleles), the resulting genotype 
will be AaBb. If these first-generation organisms are then crossed among themselves (a dihybrid 
cross), there will be four phenotypes depending on whether a dominant allele of either type is present. 
Mendel’s laws of inheritance imply that these four phenotypes should have probabilities 9/16, 3/16, 
3/16, and 1/16 of arising in any given dihybrid cross. 

The article “Inheritance of Fruit Attributes in Chili Pepper” Undian J. Hort. 2019: 86-93) reports 
the phenotype counts resulting from a dihybrid cross of two chili pepper varietals popular in India 
(WBC-Sel-5 and GVC-101). There are k = 4 categories corresponding to the four possible fruit- 
bearing phenotypes, with the null hypothesis being 


9 3 3 1 
Ho: Pi = 76: P76: P= 6: PA=i¢ 


Since the total sample size was n = 63, the expected cell counts are 63(9/16) = 35.44, 63(3/16) 
= 11.81, 11.81, and 63(1/16) = 3.94. Observed and expected counts are given in Table 13.1. 
(Although one expected count is slightly less than 5, Pearson’s chi-squared theorem should still apply 
reasonably well to this scenario.) 


Table 13.1 Observed and expected cell counts for Example 13.1 


= 1 i=2 i=3 i=4 
Single, drooping Single, erect Cluster, drooping Cluster, erect 
32 8 16 7 
(35.44) (11.81) (11.81) (3.94) 


The contribution to the 7? test statistic value from the first cell is 
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2 2 
(n, — npio) z= (32 — 35.44) _ 333 
NPi0 35.44 


Cells 2, 3, and 4 contribute 1.230, 1.484, and 2.382, respectively, so ¢ = 333 + 1.230 + 
1.484 + 2.382 = 5.43. The expected value of 7° under Hp is roughly v = k — 1 = 3 and the standard 
deviation is approximately \/2v = 2.5. So our test statistic value is only about one standard deviation 
larger than what we’d expect if the null hypothesis was true, seemingly not highly contradictory to 
A. 

More formally, a test with significance level .10 at 3 df requires ae the number in the 3 df row 
and .10 column of Appendix Table A.5. This critical value is 6.251. Since 5.43 < 6.251, Ho cannot be 
rejected even at this rather large level of significance. (The v = 3 column of Appendix Table A.10 
confirms that P-value > .10; software provides a P-value of .143.) The data is reasonably consistent 
with Mendel’s laws. i 


Why not simply use >> (N; — npio)” as the test statistic, rather than the more complicated statistic 
(13.1)? Suppose, for example, that np;y = 100 and npop = 10. Then if ny = 95 and nz = 5, the two 
categories contribute the same squared deviations to )*(n; — Pio) Yet n, is only 5% less than what 
would be expected when Ho is true, whereas nz is 50% less. To take relative magnitudes of the 
deviations into account, we divide each squared deviation by the corresponding expected count and 
then combine. 


7° for Completely Specified Probability Distributions 

Frequently researchers wish to determine whether observed data is consistent with a particular 
probability distribution. When the distribution and all of its parameters are completely specified, 
Pearson’s chi-squared test can be applied to this scenario. Later in this section, we examine the case 
when the parameters must be estimated from the available data. 


Example 13.2 In a famous genetics article (“The Progeny in Generations Fi, to F,7 of a Cross 
Between a Yellow-Wrinkled and a Green-Round Seeded Pea,” J. Genet. 1923: 255-331), the early 
statistician G. U. Yule analyzed data resulting from crossing garden peas. The dominant alleles in the 
experiment were Y = yellow color and R = round shape, resulting in the double dominant YR. Yule 
examined 269 four-seed pods resulting from a dihybrid cross and counted the number of YR seeds in 
each pod. 

Let X denote the number of YR’s in a randomly selected peapod, so possible X values are 0, 1, 2, 
3, 4. Based on the discussion in Example 13.1, the Mendelian laws are operative and genotypes of 
individual seeds within a pod are independent of one another. Thus X has a Bin(4, 2) distribution. If 
fori = 1, 2,3, 4, 5 we define p; = P(X = i- 1), then we wish to test Ho: p, = Pio, .--, Ps = Pso, Where 


pio = P(i—1 YR’s among 4 seeds when Hp is true) 


i 9\i1 9 \ 4-(-1) 
= = (en 31.9. 9.45 
(")@) ( zm) EG 


Substituting into this binomial pmf gives hypothesized probabilities .0366, .1884, .3634, .3115, and 
.1001. Yule’s data and the expected cell counts npi9 = 269pi9 are in Table 13.2. 
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Table 13.2 Observed and expected cell counts for Example 13.2 


i=l i= i=3 i=4 i=5 
X=0 X= X=2 X=3 xX=4 

16 45 100 82 26 
(9.86) (50.68) (97.75) (83.78) (26.93) 

The test statistic value is 

2 2 
> (16 — 9.86) (26 — 26.93) 
= st te = 3.8234 «+ + 032 = 4.582 
t 986. | 26.93 Ne 

Since oN ce ge = Xora = 13.277, Ho is not rejected at level .01. In fact, software provides a 
P-value of .333, so Ho should not be rejected at any reasonable significance level. Hl 


The ¢ test can also be used to test whether a sample comes from a specific underlying continuous 
distribution. Let X denote the variable being sampled and suppose the hypothesized pdf of X is fo(x). 
As in the construction of a frequency distribution in Chapter 1, subdivide the measurement scale of 
X into k disjoint intervals (—oo, aj), [a1, a2), ..., [Ax_1, 0©). The cell probabilities for i = 2, ...,k-—1 
specified by Ho are then 


Pio = P(aj-1 <X< aj) = / fo(x)dx 


dit 


and similarly for the two extreme intervals. The intervals should be chosen so that npjp > 5 fori = 1, 
..., k; often they are selected so that the pjo’s are equal. Once the p,o’s are calculated, the underlying 
distribution is, in a sense, irrelevant—the chi-squared test will determine whether data is consistent 
with any probability distribution that places probability pjo on the ith specified interval. 


Example 13.3 To see whether time of birth is uniformly distributed throughout a 24-hour day, we can 
divide a day into one-hour periods starting at midnight (k = 24 intervals). The null hypothesis states 
that f(x) is the uniform pdf on the interval [0, 24], so that pj = 1/24 for all 7. A random sample of 
1000 births from the CDC’s 2018 Natality Public Use File resulted in cell counts of 34 (midnight to 
12:59 a.m.), 28, 37, 29, 31, 28, 32, 38, 73, 50, 43, 52, 58, 58, 46, 43, 51, 35, 46, 32, 53, 31, 35, and 37 
(11:00 p.m. to 11:59 p.m.). Each expected “cell count” is 1000 - 1/24 = 41.67, and the resulting value 
of 7° is 74.432. Since cee = 41.637, the computed value is highly significant, and the null 
hypothesis is resoundingly rejected. In particular, babies were far more likely to be born in the 
8:00 a.m.—8:59 a.m. window (observed count = 73) that in any other hour of the day. H 


Example 13.4 The developers of a new online tax program want it to satisfy three criteria: (1) actual 
time to complete a tax return is normally distributed; (2) the mean completion time is 90 min; (3) 90% 
of all users will finish their tax returns within 120 min (2 h). A pilot test of the program will utilize 
120 volunteers, resulting in 120 completion time observations. This data will be used to test whether 
the performance criteria are met, using a chi-squared test with k = 8 intervals. 

Calculating normal probabilities requires both y and o. The target value = 90 min is given; the 
90th percentile of a normal distribution is 4 + 1.280, and the criterion p + 1.280 = 120 min implies 
o = 23.44 min. To divide the standard normal scale into eight equally likely intervals, we look for 
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the .125 quantile, .25 quantile, etc., in the z table. From Table A.3 these values, which form the 
boundary points of our intervals, have z-scores equal to 


1.15 .675 32 0 32 .675 1.15 


For pp = 90 and o = 23.44, these boundary points become 
63.04 74.18 82.50 90.00 97.50 105.82 116.96 


(Completion times obviously cannot be negative. The area to the left of x = O under this curve is 
negligible, so this issue is not of concern here.) If we define p; = the probability a randomly selected 
completion time falls in the ith interval defined by the above boundary points, then the goal is to test 
Ag p; = .125,...,pg = .125. 

Suppose the observed counts are as shown in the accompanying table; the expected count for each 
interval is npio = (120)(.125) = 15. 


Lower endpoint of interval 0 63.04 74.18 82.50 90.00 97.50 105.82 116.96 


Observed count 21 17 12 16 10 15 19 10 
Expected count (15) (15) (15) (15) (15) (15) (15) (15) 


The resulting test statistic value is 


2 2; 
> (21-15) (10 — 15) 
= ee | = 7.7 
. ie = 15 3 


The corresponding P-value (using Table A.10 at df = 8 — 1 = 7) exceeds .100; statistical software 
gives P-value = .357. Thus we have no reason to reject Hp; the 120 observations are consistent with a 
N(90, 23.44) population distribution, as desired. o 


Goodness-of-Fit Tests for Composite Hypotheses 

The goodness-of-fit test based on Pearson’s chi-squared theorem involves a simple null hypothesis, in 
the sense that each pj is a specified number, so that the expected cell counts when Ho is true are 
completely determined. But in some situations, Ho states only that the p,’s are functions of other 
parameters 0, ..., 0,, without specifying the values of these 0,’s. 

For example, a population may be in equilibrium with respect to proportions of the three genotypes 
AA, Aa, and aa. With p;, p2, and p3 denoting these proportions (probabilities), one may wish to test 
Ho: pi = 0", pr = 20(1 — 0),p3 = (1 — 0)*, where 0 represents the proportion of gene A in the 
population. This hypothesis is composite because knowing that Ho is true does not uniquely deter- 
mine the cell probabilities and expected cell counts, but only their general form. More generally, the 
null hypothesis now states that each p; is a function of a small number of parameters 0 = (01, ..., Om) 
with the 6;’s otherwise unspecified: 


Ao: Pp; = ™%(8),.--; Pk = ™(8) 


H,: the hypothesis Ho is not true 


(13.2) 
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In the genotype example, m = 1 (there is only one 0), 2,(0) = 07, 2(0) = 20(1—0), and 23(0) = (1-0). 
To carry out a 7 test, the unknown 6’s must first be estimated. 


In the case k = 2, there is really only a single rv, Nj (since N; + Nz = n), which has a binomial 
distribution. The joint probability that N,; =, and N2 = nz is then 


n , ; 
P(N) =m,N2 =m) = ("oo x py'Py 


where p; + p2 = 1 and n, + nz =n. For general k, the joint distribution of Nj, ..., Ny is the multi- 
nomial distribution (Section 5.1) with 


PUNG = Ry, 2. Ne = 1g) Oo py pe ees Pr 


which, when Hp is true, becomes 


P(N, = n1,.--;Ne = me) x [701 (0)]" »- ++ + [7e(0)]”" (13.3) 
METHOD Let nj, no, ..., n, denote the observed values of Ny, ..., Ny. Then 
OF MULTINOMIAL 01, aes Bin are those values of the 6;’s that maximize Expression (13.3), 
ESTIMATION that is, the maximum likelihood estimates with respect to the multinomial 
model. 


Example 13.5 In humans there is a blood group, the MN group, that is composed of individuals 
having one of three blood types: M, MN, or N. Type is determined by two alleles, and there is no 
dominance, so the three possible genotypes give rise to three phenotypes. A population consisting of 
individuals in the MN group is in equilibrium if 


P(M) =p, =@ P(MN) =p. =20(1—0) P(N) =p3=(1-0)° 
for some 0. Suppose a sample from such a population yielded the results shown in Table 13.3. 


Table 13.3 Observed counts for Example 13.5 


Type M MN M 
Observed count 125 225 150 n = 500 


Then Expression (13.3) becomes 


[71 (8)]" [772(0)]"" [3 (8)]" = [6J" [201 — 8)" (1 — 8)" 


— 32m. gen +n , (1 _ 0)” + 2n3 


Maximizing this with respect to 6 (or, equivalently, maximizing the natural logarithm of this quantity, 
which is easier to differentiate) yields 
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. 2n, +n 2n; +n 
0 = — 
(2m, +n2) + (nz + 2n3)| 2n 
With ny, = 125 and nz = 225, = 475/1000 = .475. | 


Once 0 = (0), ..., 0.) has been estimated by 0= (01, acaly Om) the estimated expected cell count 


for the ith category is np; = n7;(0). These are now used in place of the mpjo’s in Expression (13.1) to 
specify a 7° statistic. The following theorem was proved by R. A. Fisher in 1924 as a generalization 
of Pearson’s chi-squared test. 


FISHER’S Under general “regularity” conditions on 6), ..., 0, and the 7,(0)’s, if 01, ..., 
CHI-SQUARED = 0, are estimated by maximizing the multinomial expression (13.3), the rv 


THEOREM 


i=1 nP; i=l nm;(8) 


has approximately a chi-squared distribution with k — 1 — m df when Ho of 
(13.2) is true. An approximately level « test of Hp versus H, is then to reject Ho 


Bg pi a sap 
if x 2 hak—1—-m* 


In practice, the test can be used if n7;(0) >5 for every i. 


Notice that the number of degrees of freedom is reduced by the number of 0,’s estimated. 


Example 13.6 (Example 13.5 continued) With 0 = .475 and n = 500, the estimated expected cell 
counts are 


np, =nm(0) =n- 6 = 500- (.475)? = 112.81 


20(1 — 0) = 500 - 2(.475)(.525) = 249.38 
(1 — 6)? = 500 - (.525)? = 137.81 


np» = nt(0) =n- 
np3 = n73(0) =n- 
Notice that these estimated expected counts sum to n = 500. Then 


125 — 112.81)? (225 — 249.38)? (150 — 137.81)° 
ee ( ) if ( ) he ( ye A78 
112.81 249.38 137.81 


Since ca — Vice — yan = 3.843 and 4.78 > 3.843, Hp is rejected at the .05 signifi- 
cance level (software provides P-value = .029). Therefore, the data does not conform to the proposed 
equilibrium model. Even using the 0 estimate that “fits” the data best, the expected counts under the 
null model are too discordant with the observed counts. Pe] 


Example 13.7 Consider a series of games between two teams, I and II, that terminates as soon as 
one team has won four games (with no possibility of a tie)—this is the “best of seven” format used for 
many professional league play-offs. A simple probability model for such a series assumes that 
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outcomes of successive games are independent and that the probability of team I winning any 
particular game is a constant 0. We arbitrarily designate team I the better team, so that 0 > .5. Any 
particular series can terminate after 4, 5, 6, or 7 games. Let 7\(0), 22(0), 13(0), m4(0) denote the 
probability of termination in 4, 5, 6, and 7 games, respectively. Then 


m,(0) = P(I wins in4 games) + P(II wins in4 games) 
= +(1-6)* 
m2(0) = P(I wins 3 of the first 4 and the fifth) 
+ P(I loses 3 of the first 4 and the fifth) 


= (5) ea —6)-0+ (4 ou =§) (18) 


= 40(1 — 0) le" Pm (ee 0)°| 
73(0) = 100°(1 — 0) [6 + (1 — 0)°| 
n4(0) = 200°(1 — 0)? 
The Mathematics Magazine article “Seven-Game Series in Sports” by Groeneveld and Meeden tested 


the fit of this model to results of National Hockey League playoffs during the period 1943-1967, 
when league membership was stable. The data appears in Table 13.4. 


Table 13.4 Observed and expected counts for the simple model 


i= i= i= i=4 
4 games 5 games 6 games 7 games 
15 26 24 18 n= 83 
(16.351) (24.153) (23.240) (19.256) 


The estimated expected cell counts are 837;(0), where @) is the value of @ that maximizes the 
multinomial expression 


15 


fos (1 — oy} L401 — 0) fo + (1 — 09°] }"-L100(1 — 0)? [0+ 1 — 0°] }" L200 (1 — ay? 
(13.4) 


Standard calculus methods fail to yield a nice formula for the maximizing value 0, so it must be 
computed using numerical methods. The result is 0 = .654, from which 7;(0) and the estimated 
expected cell counts in Table 13.4 were computed. The resulting test statistic value is 77 = .360, 
much lower than the critical value 7749 ,—1~-m = %10.4-1-1 = L102 = 4.605. There is thus no reason to 
reject the simple model as applied to NHL playoff series, at least for that early era. 

The cited article also considered World Series data for the period 1903-1973. For the preceding 
model, y* = 5.97 > 4.605, so the model does not seem appropriate. The suggested reason for this is 
that for this simple model it can be shown that 


P(series lasts exactly six games | series lasts at least six games) > .5, (13.5) 


whereas of the 38 best-of-seven series that actually lasted at least six games, only 13 lasted exactly 
six. The following alternative model is then introduced: 
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(01,02) = 04 +(1— 0)" 13(01,02) = 1007(1 — 0))°0> 
(01,02) = 40,(1 — 01)[08 +(1—01)*] 724(0), 02) = 1007(1 — 0,)°(1 — 0) 


The first two 7;’s are identical to the simple model, while 0 is the conditional probability of (13.5), 
which can now be any number between zero and one. The values of 0 and 0> that maximize the 


multinomial expression analogous to (13.4) are determined numerically as 0; = .614 and 0, = .342. 
A summary appears in Table 13.5, and 7° = 384. Two parameters are estimated, so df = k-1l-—m= 
4-—1-2=1 and .384< bare = 2.706, indicating a good fit of the data to this new model. 


Table 13.5 Observed and expected counts for the more complex model 


4 games 5 games 6 games 7 games 
12 16 13 25 
(10.85) (18.08) (12.68) (24.39) 


One of the regularity conditions on the @;’s in Fisher’s theorem is that they be functionally 
independent of one another. That is, no single 0; can be determined from the values of other 0,’s, so 
that m is the number of functionally independent parameters estimated. A general rule for degrees of 
freedom in a chi-squared test is the following. 


GENERAL 7 
DF RULE rdf = ( 


number of freely __ ( number of independent 
determined cell counts parameters estimated 


This rule will be used in connection with several different chi-squared tests in the next section. 


7° for Probability Distributions with Parameter Values Unspecified 

In Examples 13.2-13.4, we considered goodness-of-fit tests to assess whether quantitative data was 
consistent with a particular distribution, such as Bin(4, 2) or N(90, 23.44). Pearson’s chi-squared 
theorem could be applied because all model parameters were completely specified. But quite often 
researchers wish to determine whether their data conforms to any member of a particular family—any 
Poisson distribution, any Weibull distribution, etc. To use the va test to see whether the distribution is 
Poisson, for example, the parameter 4 must be estimated. In addition, because there are actually an 
infinite number of possible values of a Poisson variable, these values must be grouped so that there 
are a finite number of cells. 


Example 13.8 Table 13.6 presents count data on X = the number of egg pouches produced by B. 
alexandrina snails that were subjected to both parasitic infection and drought stress (meant to simulate 
the effects of climate change), as reported in the article “One Stimulus, Two Responses: Host and 
Parasite Life-History Variation in Response to Environmental Stress” (Evolution 2016: 2640-2646). 


Table 13.6 Observed counts for Example 13.8 


Cell i=1 i=2 i=3 i=4 i=5 
No. of egg pouches 0 1 2 3 >4 
Observed count 44 2 5 1 9 
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Denoting the sample values by x, ..., x61, 44 of the x;’s were 0, two were 1, and so on. The nine 
observed counts in the last cell were 4, 5, 6, 6, 7, 11, 13, 15, and 17, but these have been collapsed 
into a single “> 4” category in order to ensure that all expected counts will be at least 5. 

The authors considered fitting a Poisson distribution to the data; let 4 denote the Poisson 
parameter. The estimate of y required for Fisher’s v procedure is obtained by maximizing the 
multinomial expression (13.3). The cell probabilities are 


so the right-hand side of (13.3) becomes 


—,,0 44 Ll 2 M2 5 apres) 1 3 —p x : 
STITT Se] ws 
0! 1! 2! 3! x! 


x=0 


There is no nice formula for the maximizing value of y in Expression (13.6), so it must be obtained 
numerically. a 


While maximizing Expression (13.6) with respect to y is challenging, there is an alternative way to 
estimate pu: apply the method of maximum likelihood from Chapter 7 to the full sample X,,...,Xn. 
Because parameter estimates are usually much more difficult to compute from the multinomial 
likelihood function (13.3) than from the full-sample likelihood, they are often computed using this 
latter method. Using Fisher’s critical value is then results in an approximate level « test. 


Example 13.9 (Example 13.8 continued) The likelihood of the observed sample x,, ..., x6; under a 
Poisson({1) model is 


ey ew # e lH >i e 81H 1499 


L(w) = pas H) ++ (615 H) = <— Pe ae L gale saacaeal 


The value of y for which this is maximized—i.e., the maximum likelihood estimate of j—is 
jt = 2 x;/n = 99/61 = 1.623. Using fi = 1.623, the estimated expected cell counts are computed 
from n7;(jt), where n = 61. For example, 


et 1.623)" 
0! 


nt\(jt) = 61 - = (61)(.1973) = 12.036 

Similarly, nz(jt) = 19.534, nz3(ft) = 15.852, and nz4(ji) = 8.576, from which the last count is 
nts(jt) = 61 — [12.0364 --- +8.576] = 5.002. Notice that, as planned, all of the estimated 
expected cell counts are >5, as required for the accuracy of chi-squared tests. Then 
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2 2 
>  (44—- 12.036) (9 — 5.002) 
= ft + = 117.938 
/ 12036. «5.002 
Since m = | and k = 5, at level .05 we need 055-11 = sae = 7.815. Because 117.938 > 7.815, 


we strongly reject Ho at the .05 significance level (in fact, with such a ridiculously large test statistic 
value, Ho is rejected at any reasonable «). 

The largest contributor to the z’ statistic here is the number of 0’s in the sample: 44 were observed, 
but under a Poisson model only 12.036 are expected. This excess of zeros often occurs with “count” 
data, and statisticians have developed zero-inflated versions of the Poisson and other distributions to 
accommodate this reality. I] 


When model parameters are estimated using full-sample maximum likelihood, it is known that the 
true level « critical value falls between ae and ora So, applying Fisher’s chi-squared method 
to situations such as Example 13.9 will occasionally lead to incorrectly rejecting Hp, though in 
practice this is uncommon. Sometimes even the maximum likelihood estimates based on the full 
sample are quite difficult to compute. This is the case, for example, for the two-parameter generalized 
negative binomial distribution (Exercise 17). In such situations, method-of-moments estimates are 
often used and the resulting y* value compared to ee although it is not known to what extent 
the use of moments estimators affects the true critical value. 

In theory, the chi-squared test can also be used to test whether a sample comes from a specified 
family of continuous distributions, such as the exponential family or the normal family. However, 
when the parameter values are not specified, the goodness-of-fit test is rarely used for this purpose. 
Instead, practitioners use one or more of the test procedures mentioned in connection with probability 
plots in Section 4.6, such as the Shapiro—Wilk or Anderson—Darling test. For example, the Shapiro— 
Wilk procedure tests the hypotheses 


Ho: the population from which the sample was drawn is normal 
H,: Ho is not true 


A P-value is calculated based on a test statistic similar to the correlation coefficient associated with 
the points in a normal probability plot. These procedures are generally considered superior to the chi- 
squared tests of this section for continuous families; in particular, they do not rely on creating 
arbitrary class intervals. 


More on the Goodness-of-Fit Test 

When one or more expected counts are less than 5, the chi-squared distribution does not necessarily 
accurately approximate the sampling distribution of the test statistic (13.1). This can occur either 
because the sample size n is small or because one of the hypothesized proportions p,o is small. If a 
larger sample is not available, a common practice is to sensibly “merge” some of the categories with 
small expected counts, so that the expected count for the merged category is larger. This might occur, 
for example, if we examined the political party affiliation of a sample of graduate students, catego- 
rized as Republican, Democrat, Libertarian, Green, Peace and Freedom, and Independent. The counts 
for the nonmajor parties might be quite small, in which case we could combine them to form, say, 
three categories: Republican, Democrat, and Other. The downside of this technique is that infor- 
mation has been discarded—two students belonging to different political parties (e.g., Libertarian and 
Green) are no longer distinguishable. 
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Although the chi-squared test was developed to handle situations in which k > 2, it can also be 
used when k = 2. The null hypothesis in this case can be stated as Ho: p; = Pio, since the relations 
P2 = 1 — py and p29 = 1 — pio make the inclusion of p2 = poo in Ho redundant. The alternative 
hypothesis is H,: pj # Pio. These hypotheses can also be tested using a two-tailed one-proportion 
z test with test statistic 


— (M/n) =p _ Pi — Pio 


Pi0(1 — pio) P10P20 
n n 


In fact, the two test procedures are completely equivalent. This is because it can be shown that 
Z = 7° (see Exercise 12) and Za = lon so that 7? > ea if and only if |Z| > Zy/2-" In other words, 


the two-tailed z test from Chapter 9 rejects Hp if and only if the chi-squared goodness-of-fit test does. 
However, if the alternative hypothesis is either H,: p; > Pio ot Ha: py < Pio, the chi-squared test 
cannot be used. One must then revert to an upper- or lower-tailed z test. 

As is the case with all test procedures, one must be careful not to confuse statistical significance 
with practical significance. A computed ¢ that leads to the rejection of Hy may be a result of a very 
large sample size rather than any practical differences between the hypothesized pjo’s and true p;’s. 
Thus if pio = p20 = P30 = i, but the true p,’s have values .330, .340, and .330, a large value of a is 
sure to arise with a sufficiently large n. Before rejecting Ho, the p;’s should be examined to see 
whether they suggest a model different from that of Hp from a practical point of view. 


Exercises: Section 13.1 (1-18) 


1. What conclusion would be appropriate for 
an upper-tailed chi-squared test in each of 
the following situations? 

a. «= .05, df = 4, 7? = 12.25 


course, 30% from engineering statistics, 20% 
from the statistics course for social science 
students, and the other 10% from the course 
for agriculture students. A random sample of 


b. « = .01, df = 3, va = 8.54 n = 120 clients revealed 52, 38, 21, and 9 
c. a= .10, df = 2, ra = 4.36 from the four courses. Does this data suggest 
d. a= .01,k =6, ¢ = 10.20 that the percentages on which staffing was 


2. Say as much as you can about the P-value based are not correct? State and test the 


for an upper-tailed chi-squared test in each 
of the following situations: 


a.’ = 7.5, df =2 
b. 7° = 13.0, df = 6 
c. 7° = 18.0, df =9 
d. 7? = 21.3,k =5 
e 7 =5.0,k=4 


. A Statistics department at a large university 
maintains a tutoring center for students in its 
introductory service courses. The center has 
been staffed with the expectation that 40% of 
its clients would be from the business statistics 


relevant hypotheses using « = .05. 


. It is hypothesized that when homing 


pigeons are disoriented in a certain manner, 
they will exhibit no preference for any 
direction of flight after takeoff (so that the 
direction X should be uniformly distributed 
on the interval from 0° to 360°). To test 
this, 120 pigeons were disoriented, let 
loose, and the direction of flight of each 
was recorded; the resulting data follows. 
Use the chi-squared test at level .10 to see 
whether the data supports the hypothesis. 


"The fact that 2 n= 7%,, is a consequence of the relationship between the standard normal distribution and the chi- 


squared distribution with 1 df: if Z ~ N(O, 1), then by definition Z* has a chi-squared distribution with v = 1. See 
Section 6.3. 
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Direction 0-<45° 45-<90° 90-<135° 
Frequency 12 16 17 
Direction 135-<180° 180-<225° 225—<270° 
Frequency 15 13 20 
Direction 270-<315° 315-<360° 

Frequency 17 10 


5. An information retrieval system has ten stor- 
age locations. Information has been stored 
with the expectation that the long-run pro- 
portion of requests for location 7 is given by 
the expression p; = (5.5 — |i — 5.5]) /30. 
A sample of 200 retrieval requests gave the 
following frequencies for locations 1-10, 
respectively: 4, 15, 23, 25, 38, 31, 32, 14, 10, 
and 8. Use a chi-squared test at significance 
level .10 to decide whether the data is con- 
sistent with the a priori proportions (use the 
P-value approach). 


6. The article “Racial Stereotypes in Chil- 
dren’s Television Commercials” (J. Adver. 
Res. 2008) reported the following fre- 
quencies with which characters of different 
ethnicities appeared in recorded commer- 


cials aired on Philadelphia television 
stations. 
Ethnicity | African-American Asian Caucasian Hispanic 
Frequency 57 11 330 6 


Census data at the time reported the popu- 
lation proportions for these four ethnic 
groups was .177, .032, .734, and .057, 
respectively. Does the data suggest that the 
true proportions in commercials are differ- 
ent from the census proportions? Carry out 
a test of appropriate hypotheses using a 
significance level of .01. 


7. A retail bookstore manager is re-evaluating 
weekday staffing by looking at recent sales. 
The accompanying table summarizes a 
sample of 92 recent weekday sales. 


Weekday Monday Tuesday Wednesday Thursday Friday 


Number 22 13 16 17 24 
of sales 


Assuming these sales are representative of 
all weekday sales at the bookstore, does the 
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data indicate that such sales are not evenly 
distributed throughout the week? 


8. Benford’s Law, introduced in Chapter 3 
Exercise 19, postulates that the lead digits 
(1, 2, ..., 9) in a large data set should fol- 
low the rule 


d+1 
P(lead digit is d) = log, (>) 


(So, for example, the proportion of 
numbers with a leading 1 is predicted to 
be logio(2) © .3, and the probabilities 
decrease as d increases.) The author of 
the article “Benford’s Law Applies to 
Online Social Networks” (PLoS One 
2015) used Twitter’s API to randomly 
sample 78,226 Twitter users and record 
the number of followers each person has. 
The lead digits of those counts are 
summarized below. 


Lead 
digit 1 2 5 4 5 6 7 8 9 


Fre- — 26,286 14,395 9923 7246 5737 4641 3834 3348 2816 
quency 


Does the data indicate that the lead digits of 
the variable “number of Twitter followers” 
indeed conforms to Benford’s Law? Test at 
the .05 significance level. 


9. The response time of a computer system to 
a request for a certain type of information is 
hypothesized to have an exponential dis- 
tribution with parameter 2=1 [so if 
X = response time, the pdf of X under Ho is 
fox) = e~* for x > 0). 

a. If you had observed X, Xo, ..., X,, and 
wanted to use the chi-squared test with 
five class intervals having equal proba- 
bility under Hp, what would be the 
resulting class intervals? 

b. Carry out the chi-squared test using the 
following data resulting from a random 
sample of 40 response times: 


10 = .99 1.14 1.26 3.24 12 .26 .80 
79 1.16 1.76 41 59 27° 2.22 .66 
7102.21 ©6668) ©6431 46 698 
91 55 81 251 2.77 16 1.11 .02 
2.13 19 1.21 1.13 2.93 2.14 34 44 
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11. 


12. 


. a. Show that another expression for the chi- 


squared statistic (13.1) is 


Why is it more efficient to compute 7 
using this formula? 


b. When the null hypothesis is Ho:p; = 


+++ = py = 1/k (ie., pio = Vk for all i), 
how does the formula of part (a) sim- 
plify? Use the simplified expression to 
calculate ” for the pigeon/direction data 
in Exercise 4. 


a. Having obtained a random sample from 
a population, you wish to use a chi- 
squared test to decide whether the pop- 
ulation distribution is standard normal. 
If you base the test on six class intervals 
having equal probability under Ho, what 
should the class intervals be? 

b. If you wish to use a chi-squared test to 
test Hp: the population distribution is 
normal with «= .5, o = .002 and the 
test is to be based on six equiprobable 
(under Ho) class intervals, what should 
these intervals be? 

c. Use the chi-squared test with the inter- 
vals of part (b) to decide, based on the 
following 45 bolt diameters, whether 
bolt diameter is a normally distributed 
variable with 4“ = .5 in., g = .002 in. 


4974 .4976 .4991 5014 5008 4993 
4994 5010 .4997 .4993 .5013 5000 
017.4984 =.4967) 5028) =.4975_—.5013 
4972, 5047 = .5069 =.4977 4961 4987 
4990 .4974 =.5008 =.5000 =.4967_ 4977 
4992 5007 =.4975 = 4998 = 5000 5008 
5021 4959 5015. 5012) .5056__—~«4991 
5006 .4987 4968 


Let p, denote the proportion of successes in 
a particular population. The test statistic 
value in Chapter 9 for testing Ho: p; = Pio 
was z= (p1 — pio)/VPi0P20/n, 
P20 = | — pio. Show that for the case k = 2, 
Pearson’s chi-squared _ statistic value 


where 
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satisfies va =z. [Hint: First show that 
(ny — npyo) = (Nz — npr) -] 

Consider a large population of families in 
which each family has exactly three chil- 
dren. If the sexes of the three children in 
any family are independent of one another, 
the number of male children in a randomly 
selected family will have a binomial dis- 
tribution based on three trials. 


a. Suppose a random sample of 160 fami- 
lies yields the following results. Test the 
relevant hypotheses by proceeding as in 
Example 13.5. 


Number of Male Children 0 1 2 3 
Frequency 14 66 64 16 


b. Suppose a random sample of families 
resulted in observed frequencies of 15, 
20, 12, and 3, respectively. Would the 
chi-squared test be based on the same 
number of degrees of freedom as the test 
in part (a)? Explain. 


14. A certain type of flashlight is sold with the 


four batteries included. A random sample 
of 150 flashlights is obtained, and the 
number of defective batteries in each is 
determined, resulting in the following data: 


Number defective 0 1 2 3 4 
Frequency 26 51 47 16 10 


Let X be the number of defective batteries in 
a randomly selected flashlight. Test the null 
hypothesis that the distribution of X is 
Bin(4, 0). That is, with p; = P(i defectives), 
test 


4 pi =, 
Ao: pi = (;)a — 0)" 
i= 0,1,2,3,4 


[Hint: To obtain the mle of 0, write the 
multinomial likelihood (the function to be 
maximized) as 0“(1 — 0)”, where the 
exponents uw and v are linear functions of 
the cell counts. Then take the natural log, 
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differentiate with respect to 0, equate the 
result to 0, and solve for 0.) 

15. An article in Annals of Mathematical 
Statistics reports the following data on the 
number of borers (i.e., insects that bore into 
wood) in each of 120 groups of borers. 
Does the Poisson pmf provide a plausible 
model for the distribution of the number of 
borers in a group? [Hint: Add the fre- 
quencies for 7, 8, ..., 12 to establish a 
single category “> 7.”] 


Number of 
Borers 0 12 3 4 5678 9 10 11 12 
Frequency 24 16 16 18 15965343 0 1 


16. Modeling the proliferation of E. coli in 
farm animals is critical to a safe food sup- 
ply. The article “Ecological and Genetic 
Determinants of Plasmid Distribution in 
E. coli” (Environ. Biol. 2016: 4230-4239) 
describes a study of bacterial replication in 
grazing cattle with low frequencies of 
antibiotic-resistant genes. The following 
data is provided on X = number of repli- 
cons for 527 bacterial isolates: 


No. of replicons 0 1 2 3 4 


Frequency 139 184 154 34 16 


a. The article’s authors examined whether 
X follows a Poisson distribution. Use the 
data to determine the maximum likeli- 
hood estimate of the parameter w. 

b. Perform a 7” test at the .05 significance 
level by treating the last category as 
“>A4” so that the hypothesized proba- 
bilities sum to 1. [Hint: Refer back to 
Example 13.9.] 


17. The following data on X = number of cor- 
rosion defects per segment is consistent 
with data on one of the pipelines described 
in the article “The Negative Binomial 
Distribution as a Model for External Cor- 
rosion Defect Counts in Buried Pipelines” 
(Corr. Sci. 2015: 114-131): 


18. 


The authors propose a generalized negative 
binomial model for X, which has 
pmf nb(x;r,p) = k(r,x) x p'(1— p)* 
x=0, 1, 2, ..., where k(r, 0) = 1 and 


for 


r(rt+1)---(r+x- 1) 
x! 


kK(r,x) = 
forx>1 


Based on these n = 103 randomly selected 
segments, the authors estimate the negative 
binomial parameters to be 7 = 1.272 and 
p = .258. Test the hypothesis that the data 
is consistent with a generalized negative 
binomial distribution at the .10 significance 
level. [Suggestion: To ensure that all 
expected counts are > 5, define “cells” 
by x = 0, 1, ..., 6, 7-8, and >9.] 

Each headlight on an automobile undergo- 
ing an annual vehicle inspection can be 
focused either too high (AH), too low (Z), or 
properly (NV). Checking the two headlights 
simultaneously (and not distinguishing 
between left and right) results in the six 
possible outcomes HH, LL, NN, HL, HN, 
and LN. If the probabilities (population 
proportions) for the single headlight focus 
direction are P(H) = 0;, P(L) = 02, and 
P(N) = 1 — 0, — 02 and the two headlights 
are focused independently of each other, 
the probabilities of the six outcomes for a 
randomly selected car are the following: 


pi=0; p2=05 p3= (1-6) — 02) 
Pa = 20,0. ps = 20,(1 — 0; — 02) 
Po = 202(1 = 0; = 62) 


Use the accompanying data to test the null 
hypothesis 


Ao: Pi = m1 (01, 62), +25 P6 = Tt6(O1, 02) 


where the 7,(0;, 02)’s are given previously. 
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Outcome HH LL NN AL HN LN (Hint: Write the multinomial likelihood as a 

Frequency 49 26 14 20 53 38 function of 0; and 6, take the natural log, 
then obtain 0/00; and 0/002, equate them 
to 0, and solve for 01, 0. 

13.2 Two-Way Contingency Tables 


In 


the previous section, we discussed inferential methods for a single categorical variable (e.g., 


genotype), as well as for a quantitative variable whose values have been partitioned into disjoint 
categories. We now study problems involving two categorical variables. There are two commonly 
encountered situations in which such data arises: 


. There are J populations of interest, and each population is divided into the same J categories. 


A sample is taken from the ith population (i = 1, ..., J, and the number of individuals in each of 
the J categories is recorded. For example, customers of each of J = 3 department store chains 
might have available the same J = 5 payment categories: cash, check, credit card, debit card, and 
Apple Pay. 


. There is a single population of interest, with each individual in the population categorized with 


respect to two different factors. There are J categories associated with the first factor and J cate- 
gories associated with the second factor. A single sample is taken, and individuals are “cross- 
classified” by the two factors. As an example, customers making a department store purchase 
might be classified according to both the department in which the purchase was made (with I = 6 
departments) and according to method of payment (with the same J = 5 methods as above). 


In both cases (1) and (2), the data can be summarized by reporting the counts for each combi- 


nation: (store chain, payment method) for (1) and (department, payment method) for (2). Let Nj; 
denote the number of individuals in the sample(s) falling in the (i, 7)th category. A table displaying the 


9 
Ni 


s (observed counts) is called a two-way contingency table; a prototype is shown in Table 13.7. 


Table 13.7. A two-way contingency table 


cat 


1 2 a i je J 
1 ny ny isi ny bisit Nyy 
2 Ny ; 
i ni nj 
Fd ny isis ny 


In situations of the first type, we want to investigate whether the proportions in the different 
egories (columns) are the same for all populations (rows). The null hypothesis states that the 


I populations are homogeneous with respect to these J categories. In the second situation, we 
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investigate whether the categories of the two factors occur independently of one another in the 
population. It turns out, interestingly, that the two methods of analysis are actually identical (same 
calculations, test statistic formula, and null sampling distribution)—it doesn’t matter if our two-way 
table is the result of stratified sampling (the first case) or simple random sampling (the second case). 


Testing for Homogeneity 

The test of homogeneity generalizes the two-proportion z test of Chapter 9 to the comparison of two 
or more populations with respect to two or more categories. We assume that each individual in every 
one of the 7 populations belongs in exactly one of J categories. A sample of n; individuals is taken 
from the ith population. Let n = 5> nj, the total sample size, and 


Nj; = the number of individuals in the ith sample who fall into category j 


I . ee 
the total number of individuals among 
Ny = DON 


= the n sampled who fall into category j 

As before, upper-case letters denote rvs and lower-case letters the observed values. The n,;’s are 
recorded in a contingency table with Jrows and J columns (Table 13.7). The sum of the n,;’s in the ith 
row is n;, Whereas the sum of the entries in the jth column is n.;. 


Let 
__ the proportion of the individuals in 
4’ “~ population i who fall into category j 
Thus, for population 1, the J proportions are p;,, Pj2, ..., Piz (which sum to 1) and similarly for the 


other populations. The null hypothesis of homogeneity states that the proportion of individuals in 
category j is the same for each population and that this is true for every category; that is, for every j, 
Pij = Pa = = Py: 

When Hp is true, we can use pj, Po, ..., py to denote the population proportions in the J different 
categories; these proportions are common to all J populations. The expected number of individuals 
in the ith sample who fall in the jth category when Hp is true is then E(N;;) = n; - p;. To estimate 
E(Njj), we must first estimate p;, the proportion in category j. Among the total sample of n indi- 
viduals, N.; fall into category j, so we use P; =N,/n as the estimator (this is the maximum 
likelihood estimator of p;). Substitution of P; for p; in n; - p; yields a simple formula for estimated 
expected counts under Ho: 


Ej; = estimator of the expected count in cell (i,j) 
N, _ (ith row total)(jth column total) (13.7) 


i — 


n n 


The test statistic will have the same form (13.1) as in previous chi-squared tests. The number of 
degrees of freedom comes from the general 7” df rule of the previous section. In each row of 
Table 13.7 there are J — | freely determined cell counts (each sample size n; is fixed), so there are a 
total of IJ — 1) freely determined cells. Parameters p,, ..., py are estimated, but because )° p; = 1, 
only J — 1 of these are independently determined. Thus df = (J - 1) -(J-l =(Wd- 1-1). 
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CHI-SQUARED Null hypothesis: Ho: pij = pa =-- =py j= 1,2,...,c 
TEST OF Alternative hypothesis: H,: Ho is not true 
HOMOGENEITY Test statistic value: 


observed — estimated expected : eee Ni — ej . 
= > ( a >>. i ~ ii) 


all cells estimated expected 


Rejection region: 77 > Lenya 
P-value: area under the Xr-1)( y-1) Curve to the right of ¢ 


Estimated expected counts are calculated using Expression (13.7). The test 
can safely be applied as long as all estimated expected counts are > 5. 


Example 13.10 A company packages a particular product in cans of three different sizes, each one 
using a different production line. Most cans conform to specifications, but a quality control engineer 
has identified the following reasons for nonconformance: (1) blemish on can; (2) crack in can; 
(3) improper pull tab location; (4) pull tab missing; (5) other. A sample of nonconforming units is 
selected from each of the three lines, and each unit is categorized according to reason for noncon- 
formity, resulting in the following contingency table data: 


Reason for Nonconformity 


Blemish Crack Location Missing Other Nj 
1 34 65 17 21 13 150 
2 23 52 25 19 6 125 
Production line 
32 28 16 14 10 100 
Total 89 145 58 54 29 375 


Does the data suggest that the proportions falling in various nonconformance categories are not the 
same for the three lines? The parameters of interest are various proportions, and the relevant 
hypotheses are 


Ho: the production lines are homogeneous with respect to the five nonconformance categories; that is, 


Pij = P2j = px for j = 1, ..., 5 
H,: the production lines are not homogeneous with respect to the categories 


The estimated expected frequencies (assuming homogeneity) must now be calculated using (13.7). 
Consider the first nonconformance category for the first production line. When the lines are homo- 
geneous, the estimated expected number among the 150 selected units that are blemished is 


7 (first row total) (first column total) — (150)(89) 
én = ; = = 35.60 
total of sample sizes 375 
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The contribution of the cell in the upper-left corner to a is then 


(observed — estimated expected)” — (34—- 35.60)” _on 
estimated expected 7 35.60 _ 


The other contributions are calculated in a similar manner. Figure 13.2 shows Minitab output for the 
chi-squared test. The observed count is the top number in each cell, and directly below it is the 
estimated expected count. The contribution of each cell to va appears below the counts, and the test 
statistic value is ra = 14.159. All estimated expected counts are at least 5, so merging categories is 
unnecessary. The test is based on (3 — 1)(5 — 1) = 8 df. Appendix Table A.10 shows that the values 
that capture upper-tail areas of .08 and .075 under the 8 df curve are 14.06 and 14.26, respectively. 
Thus the P-value is between .075 and .08; Minitab gives P-value = .079. The null hypothesis of 
homogeneity should not be rejected at the usual significance levels of .05 or .01, but it would be 
rejected for the higher « of .10. 


Expected counts are printed below observed counts 


blem crack loc missing other Total 
1 34 65 LT 21 13 150 
35.60 58.00 23.20 21.60 11.60 
2 23 52 25 19 6 125 
29.67 48.33 19.33 18.00 9.67 
3 32 28 16 14 10 100 
23.43 38.67 15.47 14.40 Lets 
Total 89 145 58 54 29 375 
Chisq = 0.072 + 0.845 + 1.657 + 0.017 + 0.169 + 1.498 + 0.278 + 
1.661 + 0.056 + 1.391 + 2.879 + 2.943 + 0.018 + 0.011 + 
0.664 = 14.159 
df = 8, p = 0.079 


Figure 13.2 Minitab output for the chi-squared test of Example 13.10 


It’s worth exploring the specific differences (i.e., lack of homogeneity) indicated by the va test. 
The segmented bar chart in Figure 13.3 displays the distribution of nonconformances for each of 
the three production lines. Line 2 appears to have a higher proportion of improper pull tab locations 
than the other two lines, while Line 3 has a disproportionately large number of cans with 
blemishes. 
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Nonconformity 


100 i Other 
Bi Missing 
DD Location 
B Crack 
80 Blemish 
60 
40 
20 
0 


Line 1 Line 2 Line 3 


Percent within production line 


Figure 13.3 Segmented bar chart for Example 13.10 a 


When J = 2 and J = 2 (two populations and two categories), data in a two-way table may also be 
analyzed using the two-proportion z procedure of Chapter 9; we associate j = 1 with “success” and 
j = 2 with “failure.” In this case, the chi-squared test of homogeneity is equivalent to the z test of 
Ho: p1 = p2 versus H,: p; # pz: the test statistic values are related by (z)* = 7’, and the P-values will 
be identical. The two-proportion z test allows us to consider one-sided alternatives (p; > p2 and 
P1 < P2), while the chi-squared test does not. The benefit of the chi-squared test of homogeneity is 
that we may compare more than two populations and/or consider a response variable with more than 
two categories, as we did in Example 13.10. 


Testing for Independence 

We focus now on the relationship between two different factors in a single population. The number of 
categories of the first factor will be denoted by / and the number of categories of the second factor by 
J. Each individual in the population is assumed to belong in exactly one of the J categories associated 
with the first factor and exactly one of the J categories associated with the second factor. For example, 
the population of interest might consist of all individuals who regularly watch the national news on 
television, with the first factor being preferred network (ABC, CBS, NBC, PBS, CNN, Fox News, or 
MSNBC, so [= 7) and the second factor political views (liberal, moderate, conservative, giving 
J = 3). 

For a sample of n individuals taken from the population, let N;; denote the number among the n who 
fall into the (i, j)th category pair. The observed n;;’s can be displayed in a two-way contingency table 
like Table 13.7. In the case of homogeneity for J populations, the row totals were fixed in advance, and 
only the J column totals were random. Now only the total sample size is fixed, and both the N;.’s 
(row totals) and N,;’s (column totals) are random variables. To state the hypotheses of interest, let 


Pij = the proportion of individuals in the population who 
belong in category i of factor 1 and category j of factor 2 


= P(arandomly selected individual falls in both category 
i of factor 1 and category j of factor 2) 
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Then 


J 
Di. = x pij = P(a randomly selected individual falls in category i of factor 1) 
j=l 


I 
Pj= S- pij = P(a randomly selected individual falls in category j of factor 2) 
i=l 


The null hypothesis of independence says that an individual’s category with respect to factor 1 is 
independent of the category with respect to factor 2. Recall that two events A and B are independent if 
P(AMB) = P(A) - P(B); using the above notation, this becomes pj = pj. - pj for every pair (i, j). 

The expected count in cell (i, j) is n - pj, So when Hp is true, E(Nj) = 1 - p;. - pj. To obtain a chi- 
squared statistic, we must therefore estimate the p;.s @=1, ..., J) and p,’s G =1, ..., J). The 
maximum likelihood estimators are 


A Ni. 


P;. = — = sample proportion for category i of factor 1 and 
n 

A Nj ‘ : 

P.; = — = sample proportion for category j of factor 2 
n 


This gives estimated expected cell counts identical to those in the case of homogeneity: 
e A. & Ni Nj Ni Nj 
Ey =n- P, -Pj=n-—-t=—_+ 


n n n 
(ith row total) (jth column total) (13.8) 


n 


Thus the test statistic is also identical to that used in testing for homogeneity. Perhaps surprisingly, so is 
the number of degrees of freedom! This is because the number of freely determined cell counts is 
IJ — 1, since only the total n is fixed in advance. There are J estimated p;.’s, but only J — 1 are 
independently estimated since )* p;. = 1, and similarly J — 1 p.;’s are independently estimated, so 
I+ J-—2 parameters are independently estimated. The df rule now yields df = (J - 1)-(7+ J-2) = 
IJ-I-J+1=(-1)V — 1), identical to the df for the test of homogeneity. 


CHI-SQUARED Null hypothesis: Ho: pj = pi. pj i=l,...t; jol,...J 
TEST OF Alternative hypothesis: H,: Ho is not true 
INDEPENDENCE Test statistic value: 


; 2 91 J 5 \2 
2 (observed — estimated expected)” _ (ny — i) 
> aS) a err 


all cells estimated expected 


Rejection region: y? > Xa (t-1\E-1) 
P-value: area under the ae 1iy—1) curve to the right of 77 


Estimated expected counts are calculated using Expression (13.8). The test 
can safely be applied as long as all estimated expected counts are > 5. 
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Example 13.11 Do faculty perceive lack of diversity as a problem at their universities? Each 
individual in a survey of 1312 accounting faculty members across the USA was asked, “Do you think 
educational institutions need to improve diversity among their faculty?” (Issues Account. Educ. 
2007). Respondents were then classified into three categories: Caucasian men, Caucasian women, and 
minorities. Observed and estimated expected counts (in parentheses) are given in Table 13.8. The 
estimated expected counts were calculated using Expression (13.8). 


Table 13.8 Observed and estimated expected counts for Example 13.11 


White Men White Women Minority 
Yes 355 (411.70) 279 (249.91) 158 (130.39) 
“a youn No 310 (255.23) 129 (154.93) 52 (80.84) 
No response 17 (15.07) 6 (9.15) 6 (4.77) 


All but one estimated expected count is >5; the value ¢33 = 4.77 is close enough to 5 that a ra 
analysis will still be accurate. In words, the hypotheses being tested for the population of all 
accounting faculty members in the USA are 


Ho: diversity attitude and race/sex classification are independent 
H,;: diversity attitude and race/sex classification are not independent 


From Table 13.8, the test statistic value is 


2 2 
> (355 — 411.70) (6 — 4.77) 
= bene a = 45. 
411.70 477 ae 


and because 45.065 > ou (3—1)(3-1) = Rois = 13.277, the hypothesis of independence is rejected at 


the .01 significance level. (The P-value is 0 to several decimal places.) The data suggests that a 
faculty member’s attitude toward diversity is not independent of race/sex. 

A segmented bar chart (Figure 13.4), along with the observed and expected counts, allows us to 
explore further. We see that Caucasian males were much more likely than expected to say diversity 


Response 
1 Not Sure 
Hi No 
i Yes 


100 


80 


60 


40 


Percent within category 


20 


White men White women Minority 


Figure 13.4 Segmented bar chart for Example 13.11 
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doesn’t need to be improved (observed No’s = 310, expected No’s = 255.23), while minority faculty 
said “Yes” to the question of diversity improvement more often than expected if Ho were true 
(observed Yes’s = 158, expected = 130.39). Caucasian women’s responses were somewhere in- 
between. a 


Ordinal Factors and Logistic Regression 

Sometimes a factor has ordinal categories, meaning that there is a natural ordering. For example, 
there is a natural ordering to freshman, sophomore, junior, senior. In such situations we can use a 
method that often has greater power to detect relationships by adapting the logistic regression model 
of Chapter 12. 

Consider the case in which the first (row) factor is ordinal and the other (column) factor has two 
categories. Denote by X the level of the ordinal factor, which will be the predictor in the model. Let 
Y designate the column, so Y will be the response variable in the model. It is convenient for purposes 
of logistic regression to label the two columns as Y = 0 (failure, j = 1) and Y = 1 (success, j = 2), 
corresponding to the usual notation for Bernoulli trials. In terms of logistic regression, p(x) is the 
probability of success given that X = x: 


p(x) = P(Y = 1|\X =x) = P(j =2\i= x) = 7? — 
Put Pe 


Then the logistic model of Chapter 12 says that there exist parameters fo, /, satisfying 


obo + Bix — _ P(x) _ pa 
1—p(x) pu 


In terms of the odds of success in a row (estimated by the ratio of the two counts), the model says that 
the odds change proportionally (by the fixed multiple e’', the odds ratio) from row to row. For 
example, suppose a test is given in grades 1, 2, 3, and 4 with successes and failures as follows: 


Grade Failed Passed Estimated odds 
1 45 45 1 
2, 30 60 2 
3 18 72 4 
4 10 80 8 


Here the model fits perfectly, with odds ratio ef: =2,s0 §, = In(2) and fo = —In(2). If a table with 
I rows and 2 columns has roughly a common odds ratio from row to row, then the logistic model 
should be a good fit if the rows are labeled with consecutive integers. 

We focus on fi}; because the relationship between the two variables hinges on this parameter. The 
hypothesis of no relationship is equivalent to Ho: 6, = 0, which is usually tested against a two-tailed 
alternative. 


Example 13.12 Is there a relationship between TV watching and physical fitness? For an answer we 
refer to the article “Television Viewing and Physical Fitness in Adults” (Res. Quart. Exerc. Sport 
1990: 315-320). Subjects were asked about their TV viewing habits and were classified as physically 
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fit if they scored in the excellent or very good category on a step test. Table 13.9 shows the results in 
the form of a 4 x 2 table. 


Table 13.9 TV versus fitness results 


TV Time Unfit Fit 
b= 1 Oh 147 35 
b= 2 12h 629 101 
L=3 3-4h 222 28 
i=4 5+h 34 4 


The rows need to be given specific numeric values for computational purposes, and it is conve- 
nient to make these just 1, 2, 3, 4, because consecutive integers correspond to the assumption of a 
common odds ratio from row to row. (The columns may need to be labeled as 0 and | for input to 


software.) The logistic regression results from R are shown in Figure 13.5, where the estimated 


coefficient B , is given as —.2907, for an odds ratio of e907 ~ 75. This means that, for each increase 


in TV watching category, the odds of being fit decline to about 3/4 of the previous value. 


Coefficients: 

Estimate Std. Error z value Pr(>|z|) 
(Intercept) -1.2132 0.2675 -4.535 5.75e-06 *** 
TV -0.2907 0.1256 -2.315 0.0206 * 


Figure 13.5 Logistic regression output (from R) for TV versus fitness 


The P-value of .0206 associated with the z test of Ho: 6; = 0 indicates that we should reject Ho at 
the .05 level and can conclude that there is a relationship between TV watching and fitness. Of course, 
the existence of a relationship does not imply anything about one causing the other, because this was 
an observational study and not a randomized comparative experiment. 

A chi-squared test of the same data, which treats both variables as unordered and does not exploit 
the ordinal nature of the TV viewing variable, yields 7° = 6.161 with 3 df, P-value = .104. So with 
this test we would not conclude that there is a relationship, even at the 10% level. There is an 
advantage in using logistic regression for this kind of data. a 


The analysis of two ordinal variables, each with more than two levels, can also be handled with 
logistic regression, but it requires a procedure called ordinal logistic regression that allows an ordinal 
response variable. When one factor is ordinal and the other is not, the analysis can be done with 
multinomial (also called nominal or polytomous) logistic regression, which allows a nonordinal 
response variable. 

Models and methods for analyzing data in which each individual is categorized with respect to 
three or more factors (multidimensional contingency tables) are discussed in several of the references 
in the bibliography. 
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Exercises: Section 13.2 (19-31) 


19. Reconsider the tax holiday data of Exercise 
60(b) in Chapter 10. Use a 7’ statistic to 
test the hypothesis of equal population 
proportions. The v statistic should be the 
square of the z statistic in that exercise. 
How are the P-values related? 


20. Should illegal downloading of intellectual 
property (music, images, etc.) be punished? 
This question was asked of 501 teenagers in 
a study published by KRC Research (Jan- 
uary, 2008). The teenagers were also asked 
whether they were familiar with the laws 
against illegal downloading. 


Familiar 
with the law? 


Yes No 
Should illegal downloads Yes 209 140 
be punished? No 46 106 


Are familiarity with the law and attitude 
toward illegal downloading independent 
factors within the teenage population? Test 
at the 5% significance level. If these factors 
are not independent, describe the nature of 
the association. 


21. Brushing your teeth helps prevent cavities, 
doesn’t it? Consider the following data 
from a survey and subsequent dental exam 
of Italian 12-year-olds (“Influence of 
Occlusal Disorders, Food Intake and Oral 
Hygiene Habits on Dental Caries on Ado- 
lescents,” Dentistry 2016). 


Brushing Freq. Cavities No cavities 
Never 11 7 
Once a day 24 21 
2 times a day 99 77 
3 times a day 107 1iy 
4 times a day 42 30 


a. Test whether brushing frequency and 
the presence/absence of cavities are 
independent in the population of Italian 
12-year-olds at the .05_ significance 
level. 
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b. Discuss the results of part (a): what are 
some possible explanations for this 
potentially surprising finding? 


22. The authors of the article cited in the previ- 
ous exercise also considered the relationship 
between children’s dental health and their 
parents’ education level. In the accompa- 
nying table, fathers’ education levels are 
translated from the Italian educational sys- 
tem into the rough US equivalent. 


Father’s Education Cavities No cavities 
<High school 51 22 
High school 88 56 
Some college 76 ao 
Higher ed. degree 40 40 


Does the data indicate that parental educa- 
tion is related to the prevalence of cavities 
in children? State the appropriate null and 
alternative hypotheses, compute the value 
of v, and obtain information about the 
P-value. How would you then answer the 
question posed? 


23. Do vacation habits vary by sex? The 2006 
Expedia Vacation Deprivation Survey 
interviewed 968 Canadian adults (psos 
Insight May 18, 2006). The accompanying 
table shows each person cross-classified by 
sex and the number of vacation days the 
person “usually [takes] each year.” 


Number of vacation days 
Sex None 1-5 6-10 11-15 16-20 20-25 >25 


Female 42 25 79 94 70 58 79 
Male 51 21 67 111 71 82 118 


Is there evidence at the « = .05 significance 
level to conclude that the distribution of the 
number of vacation days taken is different 
for the two sexes? 


24. How universal is the notion of “green light 
good, red light bad’? The article “Effects of 
Personal Experiences on the Interpretation 
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of the Meaning of Colours Used in the 
Displays and Controls in Electric Control 
Panels” (Ergonomics 2015: 1974-1982) 
reports the results of a survey of 144 people 
with occupations related to electrical 
equipment and 206 people in unrelated 
fields. Each person was asked to identify 
the correct meaning of colored panel lights; 
the accompanying data shows answers for 
the color red. 


Red Light Meaning? 


Emergency Normal Other/ 
Occupation situation situation unknown 
Elec. Equip. 86 40 18 
Other 185 5 16 


25: 


26. 


Does the data indicate a difference in how 
those with electrical equipment experience 
and those without understanding the 
meaning of a red panel light? Test at the .01 
significance level. Discuss your findings. 


The article “Student-Faculty Interaction in 
Research Universities” (Res. High. Educ. 
2009: 437-459) reported that 20.4% of 
3168 students from lower-class families 
said they frequently talked with faculty 
outside class about course material. The 
corresponding percentages for the 16,774 
middle-class students and 8188 upper-class 
students were 18.6% and 20.2%, respec- 
tively. Does this data suggest that social 
class of a student is independent of whether 
or not he/she frequently talked with faculty 
outside class about course material? 

a. Carry out an appropriate test of 
hypotheses. [Hint: Think about how 
to lay out the data as a two-way table 
first. ] 

b. In light of the sample sizes used in this 
study, why is the result in (a) not sur- 
prising? 


Show that the chi-squared statistic for the 
test of independence can be written in the 
form 


27. 


28. 


29. 
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Why is this formula more efficient compu- 
tationally than the defining formula for 2 


Suppose a random sample of students were 
categorized with respect to political views 
(liberal, moderate, conservative), marijuana 
usage (never, rarely, frequently), and reli- 
gious affiliation (Christian, Jewish, Muslim, 
and other). The data could be displayed in 
four different two-way tables, one corre- 
sponding to each category of the third 
factor. With p,, = P(political category i, 
marijuana category j, and religious category 
k), the null hypothesis of independence 
of all three factors states that pix = 
Pi. Pj. P-k- Let ny denote the observed 
frequency in cell (i, j, k). Show how to 
estimate the expected cell counts assuming 
that Ho is true (@jj% = npjx, SO the pjx’s 
must be determined). Then use the general 
df rule to determine the number of degrees 
of freedom for the chi-squared statistic. 


Suppose that in a particular state consisting 
of four distinct regions, a random sample of 
n, voters is obtained from the kth region for 
k =1, 2, 3, 4. Each voter is then classified 
according to which candidate (1, 2, or 3) 
he/she prefers and according to voter reg- 
istration (1 = Dem., 2 = Rep., 3 = Other). 
Let pj denote the proportion of voters in 
region k who belong in candidate category 
i and registration category j. The null 
hypothesis of homogeneous regions is 
Ho: Piya = Pi2 = Pas = Pia for all i, j Ge., 
the proportion within each candidate/ 
registration combination is the same for 
all four regions). Assuming that Hp is true, 
determine pj, and éj, as functions of the 
observed n,;,’s, and use the general df rule 
to obtain the number of degrees of freedom 
for the chi-squared test. 


Consider the accompanying 2 x 3 table 
displaying the sample proportions that fell 


13.2 Two-Way Contingency Tables 


30. 


31. 


in various combinations of categories (e.g., 
13% of those in the sample were in the first 
category of both factors). 


1 2 3 
1.13 .19 28 
2) .07 ll 22 


a. Suppose the sample consisted of n = 100 
people. Perform the chi-squared test for 
independence with significance level .10. 

b. Repeat part (a) assuming that the sample 
size was n = 1000. 

c. What is the smallest sample size n for 
which these observed proportions would 
result in rejection of the independence 
hypothesis at the .10 level? 


Use logistic regression to test the relation- 
ship between cavities and father’s educa- 
tion in Exercise 22. Compare the P-value 
with what was found in Exercise 22. 
(Remember that y7 = z’.) Explain why you 
expected the logistic regression to give a 
smaller P-value. 

A random sample of 100 faculty at a uni- 
versity gives the results shown below for 
professorial rank versus sex. 


Rank Male Female 
Professor 25 9 
Assoc Prof 20 8 
Asst Prof 18 20 


a. Test for a relationship at the 5% level 
using a chi-squared statistic. 

b. Test for a relationship at the 5% level 
using logistic regression. 

c. Compare the P-values in parts (a) and 
(b). Is this in accord with your expec- 
tations? Explain. 

d. Interpret your results. Assuming that 
today’s assistant professors are tomor- 
row’s associate professors and profes- 
sors, do you see implications for the 
future? 
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Supplementary Exercises: (32-43) 


32. The report “Majoring in Money: How 


33. 


34. 


American College Students Manage Their 
Finances” (Sallie Mae 2016) includes the 
following data on whether students in dif- 
ferent age groups have at least one credit 
card. Data was based on a survey of ran- 
domly selected US college students. 


Age n Credit card(s)? 
18-20 348 43% 
21-22 258 63% 
23-24 187 1% 


Does the data provide convincing evidence 
that, among US college students, credit card 
ownership rate varies by age group? Test at 
the « = .O1 significance level. [Hint: Think 
about how to lay out a contingency table.] 
The report cited in the previous exercise also 
asked students with credit cards how much 
they pay off each month. 


Male Female 
Full balance 146 131 
Minimum payment 17 21 
Other §2 78 


Perform a 7 test, and report your results at 
the .05 significance level. Be clear about what 
hypotheses you’re testing! 


The article “Psychiatric and Alcoholic 
Admissions Do Not Occur Disproportionately 
Close to Patients’ Birthdays” (Psych. Rep. 
1992: 944-946) focuses on the existence of 
any relationship between date of patient 
admission for treatment of alcoholism and 
patient’s birthday. Assuming a 365-day year 
(i.e., excluding leap year), in the absence of 
any relation, a patient’s admission date is 
equally likely to be any one of the 365 pos- 
sible days. The investigators established four 
different admission categories: (1) within 
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7 days of birthday, (2) between 8 and 30 days, 
inclusive, from the birthday, (3) between 31 
and 90 days, inclusive, from the birthday, and 
(4) more than 90 days from the birthday. 
A sample of 200 patients gave observed fre- 
quencies of 11, 24, 69, and 96 for categories 1, 
2, 3, and 4, respectively. State and test the 
relevant hypotheses using a significance level 
of .01. 

35. A Gallup survey (August 9, 2019) asked 
adults who consume alcoholic beverages for 
their favorite type. The following table shows 
responses separated by the region where each 
respondent lives: 


Liquor Wine Beer 
East 63 84 105 
Midwest 86 73 156 
South 197 174 186 
West 106 120 131 


Does the data suggest that adult beverage 
preferences vary by region? Test at the .05 
significance level. Discuss your findings. 


36. Qualifications of male and female head and 
assistant college athletic coaches were com- 
pared in the article “Sex Bias and the Validity 
of Believed Differences Between Male and 
Female Interscholastic Athletic Coaches” 
(Res. Q. Exerc. Sport 1990: 259-267). Each 
person in random samples of 2225 male 
coaches and 1141 female coaches was clas- 
sified according to number of years of 
coaching experience to obtain the accompa- 
nying two-way table. Is there enough evi- 
dence to conclude that the proportions falling 
into the experience categories are different for 
men and women? Use « = .O1. 


Years of Experience 


Sex 1-3 4-6 7-9 10-12 13+ 
Male 202 369 482 361 811 
Female 230 251 238 164 258 


37. The authors of the article “Predicting Pro- 
fessional Sports Game Outcomes from 
Intermediate Game Scores” (Chance 1992: 
18-22) used a chi-squared test to determine 


38. A study in 
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whether there was any merit to the idea that 
basketball games are not settled until the last 
quarter, whereas baseball games are over by 
the seventh inning. They also considered 
football and hockey. Data was collected for 
189 basketball games, 92 baseball games, 80 
hockey games, and 93 football games. The 
games analyzed were sampled randomly 
from all games played during the 1990 sea- 
son for baseball and football and for the 
1990-1991 season for basketball and hockey. 
For each game, the late-game leader was 
determined, and then it was noted whether 
the late-game leader actually ended up win- 
ning the game. The resulting data is sum- 
marized in the accompanying table. 


Sport Late-Game Leader Late-Game Leader 
Wins Loses 
Basketball 150 39 
Baseball 86 6 
Hockey 65 15 
Football 72 21 


The authors state, “Late-game leader is 

defined as the team that is ahead after three 

quarters in basketball and football, two peri- 

ods in hockey, and seven innings in baseball. 

The chi-square value on three degrees of 

freedom is 10.52 (P < .015).” 

a. State the relevant hypotheses and reach a 
conclusion using a = .05. 

b. Do you think that your conclusion in part 
(a) can be attributed to a single sport being 
an anomaly? 


the Journal of Marketing 
Research investigated the relationship 
between facility conditions at gas stations 
and aggressiveness in the pricing of gasoline. 
The accompanying data is based on a random 
sample of n = 441 stations. 


Observed pricing policy 


Aggressive Neutral Nonaggressive 


Condition Substandard 24 15 17 
Standard 52 73 80 
Modern 58 86 36 


Supplementary Exercises 


39. 


40. 


a. Does the data suggest that an association 
exists between these two variables? Test 
at the « = .O1 level. 


b. If a statistically significant association 


exists, describe that association carefully 
and in context. 


The Associated Press (Dec. 7, 2005) 
reported on an international survey about 
the treatment of terrorist suspects. Random 
samples of 1000 adults from each of several 
nations were asked, “Do you feel the use of 
torture against suspected terrorists to obtain 
information about terrorists activities is 
justified?” Data consistent with the article 
appears in the accompanying table. 


Okay to torture terror suspects? 


Country Never Rarely Sometimes Often Not sure 
Italy 600 140 140 90 30 
France 400 250 200 120 30 
South Korea 100 330 470 60 40 
Spain 540 160 140 70 90 
USA 360 230 270 110 30 


Does the data suggest that attitudes toward 
the treatment of terrorist suspects differed 
between these five nations in 2005? State 
and test the relevant hypotheses at the 
a = .01 level. Comment on any specific 
trends. 


The likelihood ratio test of Chapter 9 pro- 
vides an alternative to Pearson’s chi- 
squared statistic. Let L(pi,...,pr) = 
C-p}'---pi* denote the multinomial like- 
lihood function (C will be irrelevant in what 
follows). The likelihood ratio test statistic is 


A L(Pi0, - +s Pko) 
E(p1, ++ +.Px) * 

where P; = N; /n, the sample proportion of 
observations in the ith category (the mle for 
pi). The key result required for the test is 
that for large n, —21n(A) has approximately 
a %_, distribution. 
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a. Using the information provided, sim- 
plify the test statistic —2In(A) as much 
as possible. 

b. If a roulette wheel is working properly, 
spins should land on the colors black, 
red, and green in proportions 18/38, 
18/38, and 2/38, respectively. Suppose 
that 190 spins resulted in 96 black, 76 
red, and 18 green. Use the likelihood 
ratio test to determine whether the 
sample data is compatible with the the- 
oretical probabilities. 

c. Use Pearson’s chi-squared goodness-of- 
fit test for the data in part (b). How do 
the results of the two tests compare? 


41. The NCAA basketball tournament begins 


with 64 teams that are apportioned into four 

regional tournaments, each involving 16 

teams. The 16 teams in each region are then 

ranked (seeded) from 1 to 16. During the 
12-year period from 1991 to 2002, the top- 
ranked team won its regional tournament 

22 times, the second-ranked team won 10 

times, the third-ranked team won 5 times, 

and the remaining 11 regional tournaments 

were won by teams ranked lower than 3. 

Let P,; denote the probability that the team 

ranked i in its region is victorious in its 

game against the team ranked j. Once the 

P;;’s are available, it is possible to compute 

the probability that any particular seed wins 

its regional tournament (a complicated 
calculation because the number of out- 
comes in the sample space is quite large). 

The paper “Probability Models for the 

NCAA Regional Basketball Tournaments” 

(Amer. Statist. 1991: 35-38) proposed 

several different models for the Pj;’s. 

a. One model postulated P; = .5 — AG — j) 
with A= (from which P61 = 4, 
Pi62 = Z, etc.). Based on this, P(seed 
#1 wins) = .27477, P(seed #2 wins) = 
.20834, and P(seed #3 wins) = .15429. 
Does this model appear to provide a 
good fit to the data? 
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42. 


b. A more sophisticated model has Pj = 
5 + .2813625(z; — z;), where the z’s are 
measures of relative strengths related to 
standard normal percentiles [percentiles 
for successive highly seeded teams are 
closer together than is the case for teams 
seeded lower, and .2813625 ensures that 
the range of probabilities is the same as 
for the model in part (a)]. The resulting 
probabilities of seeds 1, 2, or 3 winning 
their regional tournaments are .45883, 
.18813, and .11032, respectively. Assess 
the fit of this model. 


Have you ever wondered whether soccer 

players suffer adverse effects from hitting 

“headers”? The authors of the article “No 

Evidence of Impaired Neurocognitive Per- 

formance in Collegiate Soccer Players” 

(Amer. J. Sports Med. 2002: 157-162) inves- 

tigated this issue from several perspectives. 

a. The paper reported that 45 of the 91 
soccer players in their sample had suf- 
fered at least one concussion, 28 of 96 
nonsoccer athletes had suffered at least 
one concussion, and only 8 of 53 stu- 
dent controls had suffered at least one 
concussion. Analyze this data and draw 
appropriate conclusions. 

b. For the soccer players, the sample cor- 
relation coefficient calculated from the 
values of x = soccer exposure (total 
number of competitive seasons played 
prior to enrollment in the study) and 
y= score on an immediate memory 
recall test was r = —.220. Interpret this 
result. 

c. Here is summary information on scores 
on a controlled oral word association test 
for the soccer and nonsoccer athletes: 


ny = 26,x, = 37.50, 5; = 9.13, 
Nz = 56, X2 = 39.63, 52 = 10.19 
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Analyze this data and draw appropriate 
conclusions. 

d. Considering the number of prior 
nonsoccer concussions, the values of 
mean + SD for the three groups were 
soccer players, .30 + .67; nonsoccer 
athletes, .49 + .87; and student con- 
trols, .19 + .48. Analyze this data and 
draw appropriate conclusions. 


43. Do the successive digits in the decimal 


expansion of z behave as though they were 

selected from a random number table (or 

came from a computer’s random number 
generator)? 

a. Let po denote the long-run proportion of 
digits in the expansion that equal 0, and 
define p;, ..., Po analogously. What 
hypotheses about these proportions 
should be tested, and what is df for the 
chi-squared test? 

b. Ho of part (a) would not be rejected for 
the nonrandom sequence 012 ... 901 ... 
901 .... Consider nonoverlapping groups 
of two digits, and let p,; denote the long- 
run proportion of groups for which the 
first digit is 7 and the second digit is 
j. What hypotheses about these propor- 
tions should be tested, and what is df for 
the chi-squared test? 

c. Consider nonoverlapping groups of 5 
digits. Could a chi-squared test of 
appropriate hypotheses about the Pjjxin’s 
be based on the first 100,000 digits? 
Explain. 

d. The paper “Are the Digits of a an 
Independent and Identically Distributed 
Sequence?” (Amer. Statist. 2000: 12- 
16) considered the first 1,254,540 dig- 
its of 72, and reported the following 
P-values for group sizes of 1, ..., 5 
digits: .572, .078, .529, .691, .298. What 
would you conclude? 


®) 


Check for 
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Introduction 
In this chapter we consider some inferential methods that are different in important ways from those 
considered earlier. Recall that many of the confidence intervals and test procedures developed in 
Chapters 8, 9, 10, 11 and 12 were based on some sort of a normality assumption. As long as such an 
assumption is at least approximately satisfied, the actual confidence and significance levels will be at 
least approximately equal to the “nominal” levels, those prescribed by the experimenter through the 
choice of particular ¢ or F critical values. However, if there is a substantial violation of the normality 
assumption, the actual levels may differ considerably from the nominal levels (e.g., the use of t.925 ina 
confidence interval formula may actually result in a confidence level of only 88% rather than the 
nominal 95%, more than doubling the error rate). Here we develop nonparametric or distribution-free 
procedures that are valid for a wide variety of underlying distributions rather than being tied to 
normality. We have actually already introduced several such methods: the bootstrap intervals and 
permutation tests are valid without restrictive assumptions on the underlying distribution(s). 
Section 14.1 details inference procedures for population quantiles—the population median, 90th 
percentile, and so on—that apply to any continuous distribution. In Section 14.2, we present alter- 
natives to the one-sample ft procedures that do not require population normality (although they do 
make some less-restrictive distributional assumptions). The most popular nonparametric methods are 
so-called rank-based tests, wherein the original raw data is replaced by their ranks (1 for the smallest 
observation, 2 for the next smallest, etc.). Sections 14.3 and 14.4 describe rank-based alternatives to 
two-sample ¢ procedures, one-way ANOVA, and randomized block ANOVA. 


14.1 Exact Inference for Population Quantiles 


The inferential methods presented so far in this book—including ¢ tests and the analysis of variance— 
have largely focused on one or more population means. However, in some situations other summary 
measures are more relevant. For example, house prices in any particular city or region are famously 
right-skewed: a small number of very large, expensive homes inflates the mean cost. Realtors or 
buyers looking to quantify the “typical” house price in an area might be better served to estimate the 
population median price, rather than the mean price. Or, an internet service provider may plan to 
charge an extra fee to the 5% of customers with the heaviest data usage, meaning that the population 
95th percentile is of interest to the company. 
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As in Chapter 4, we will use the notation 77, to denote the (100p)th percentile (aka the pth quantile) 
of a probability distribution, i.e., the value that separates the lowest (100p)% of values from the rest. 
So, for instance, a population upper quartile (75th percentile) will be denoted by 775. The population 
median, 75, will be more frequently denoted by ji as in earlier sections. Here, we first develop a 
general confidence interval method for 7,,. Then, we present a hypothesis testing procedure for ju 
which may be applied to the analysis of both one-sample and paired data. 


A Cl for a Population Quantile 

Inferences on population quantiles depend, perhaps not surprisingly, on the quantiles of the sample 
data. To that end, let X,,...,X,, represent a random sample from some continuous population dis- 
tribution of interest. The order statistics Y,,..., Y,, as defined in Chapter 5 are 


Y, = the smallest among Xj, X2,...,X, (i.e., the sample minimum) 
Y2 = the second smallest among Xj, X2,...,Xn 


Y,, = the largest among X,, X2,...,X, (the sample maximum) 


Because the population distribution is assumed continuous, with probability one there will be no ties 
among the X;’s and, hence, Yj}<Y.<--: <Y,. Note, though, that no other assumptions (such as 
normality) are made about the population—the methods presented here apply to a broad range of 
distributions. Confidence intervals for population quantiles rely on the following proposition. 


PROPOSITION Let Xj,...,X, be a random sample from a continuous distribution with pth 
quantile ,, (0 <p < 1) and let Y;,..., ¥, represent the corresponding order 
statistics. Then for any two integers r and s satisfying 1 <r<s <n, 


P.<_<¥)=>° (Jean (14.1) 


Proof For any integer k between 1 and n — 1, 1, will fall between the consecutive order statistics Y, 
and Y;,,; if and only if exactly k of the X;’s are < n,. Now consider X; a success if X; <1, and a 
failure otherwise. Since the X;’s are independent, the number of successes among them (n indepen- 
dent trials) is a binomial rv with parameters n and P(success) = P(X; <1) = p. Hence 


n 
P(e <n, < Ye41) = Plexactly k of the X7 s are <n,) = (i) ae ied 


Therefore, for integers r < 5, 


P(Y,<n, < Ys) = PUY, <9, <r} U{Yr41 Sy S¥r42}U eye U{Ys-1 <p 


s—l n 
= PY, S95 Vai) oe PO SY) = ( Jpn a 
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Suppose now that a confidence interval for 7, is desired. The preceding proposition indicates that if 
the interval (y,,y,;) is used, then the associated confidence level is the binomial probability on the 
right-hand side of Expression (14.1). Due to the discrete nature of this probability, it will not be 
possible to achieve every confidence level. In practice, r and s are selected so that the confidence level 
from (14.1) is as close as possible to, but no lower than, the desired level. 


Example 14.1 Let’s determine a 90% confidence interval for the population upper quartile 1 7; based 
on a sample of size n = 20 from any continuous population distribution. Logically, the interval should 
straddle the sample 75th percentile, which is very roughly the .75(20) = 15th ordered observation. The 
required indices r and s can then by determine by trial and error using (14.1). For instance, 


18-1 


20 = 
P(Yi2 S45 < Vis) = » ( k Jeastasr “= 8678 


k=12 


Hence the interval (12, y;g) is slightly too “narrow,” in the sense that the associated confidence level 
is just shy of 90%. However, a similar calculation shows P(Y12 <1 75 < Yi9) = .9348 > .90, and this 
is as close as we can get to .90 without going under. So, the suggested CI is (12, yj9). 

With 93.48% confidence, the parameter 7 75 lies between y,2 and yj9, the 12th-smallest and 19th- 
smallest (i.e., second largest) ordered values from a sample of size n = 20. Again, this interval is valid 
for any (continuous) population; the X;’s may come from a normal, Weibull, or any other continuous 
distribution. | 


Expression (14.1) can be modified slightly to obtain one-sided bounds for the pth quantile. If an 
upper confidence bound is desired, delete Y,. from the left-hand side of (14.1) and substitute r = 0 into 
the binomial calculation on the right-hand side. Similarly, eliminating Y, and substituting s— 1 = nin 
the binomial calculation results in a lower confidence bound. 

Determining the indices r and s to achieve a desired confidence level can clearly be tedious. Notice 
that if the desired (two-sided) confidence level is 100(1 — «)%, then on the right-hand side of (14.1) 
rand s — | are effectively the «/2 and 1 — «/2 quantiles of the Bin(n, p) distribution. Using the normal 
approximation to the binomial from Chapter 4 with a continuity correction, r and s can then be 
approximated by 


r— 5% Wb %y/20 = re (np+.5) — 24/2\/np(1 — p) 
(s—1)+.5% wt zy/29 5% (np +.5) +2y/2\/np(1 — p) 


In Example 14.1, even though the normal approximation is of questionable accuracy (n is fairly 
small), the preceding expressions give r © 12.3 and s & 18.7, which round to the correct integers 
found in the example. 


Hypothesis Testing for a Population Median 

A binomial calculation similar to the one presented in Expression (14.1) can also be used to calculate 
the P-value for a hypothesis test concerning a population quantile. Here we focus on the population 
median, because this is the most common quantile of interest, but the ideas can easily be generalized 
to any other percentile. Consider the null hypothesis Hp: ji = jig, where jip is the null value of the 
population median. If Ho is true, we expect roughly half of the sample observations to fall below {ip 
and the other half above it. A test procedure is based on counting how many of the x;’s exceed [ip. 
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Example 14.2 Example 1.17 presented n = 57 observations on total nitrogen (TN) load (kg/day) 
from a particular location in Chesapeake Bay. The data is extremely right-skewed (see Figure 1.17). 
The sample median of these 57 observations is x = 92.2. Let’s test the hypothesis that the population 
median TN load exceeds 60 kg/day. 

With jt = true median TN load in this part of Chesapeake Bay, the null and alternative hypotheses are 


Ho: h= 60 
Ha: ju > 60 


Of the 57 TN values in the sample, 36 exceed 60 kg/day and the other 21 are below this value. If Ho 
is true, we’d expect 28.5 measurements on either side of 60, so the data appears to somewhat 
contradict the null hypothesis. 

To compute a P-value, let’s determine the probability that 36 or more of the observations in a 
sample of size 57 would exceed 60 kg/day, if that is truly the population median. If Hp is true, the 
number of X;’s that exceed 60 has a binomial distribution with n = 57 and p = .5, since by definition 
P(X; > ft) = P(X;<jt) = .5. Reflecting the upper-tailed alternative hypothesis, the P-value is 


P-value = P(36 or more X;’s exceed 60, when ji = 60) 


= 3 (7 Jota =) = 031 
6 


k=3 


With this low P-value, Hp is rejected (in particular, .031 < .05). At the .05 significance level, we 
have evidence that the true median TN load at this location exceeds 60 kg/day. 

The preceding test can also be reframed by defining a new parameter. Let p = P(X; > 60), the 
probability that a random TN observation will exceed 60. If Ho is true, then 60 is the population 
median and p = .5. However, if H, is true, then 60 kg/day is less than the actual median j4, and so 
more than half of the X distribution exceeds 60. That is, H,, is equivalent to the assertion that p > .5. 
To test the modified hypotheses 


Ho: p= .5 
Ay: p > 5 


we can use either of the one-proportion procedures presented in Section 9.3. The P-value for the 
exact binomial test from the end of that section is identical to the calculation above; alternatively, 
since n is fairly large, the one-proportion z test would also be appropriate. a 


The hypothesis test illustrated in Example 14.2 is called a (one-sample) sign test. Why a “sign” 
test? One way to think about the binomial count is to look at the quantities (X1 — fig),...; (Xn — flo): 
each has a positive sign (i.e., is > 0) when X; > fu) and a negative sign when X;< jig. In Example 
14.2, the data was equivalent to 36 positive signs and 21 negative signs, and the test statistic value 
was the number of positive signs. 

The one-sample sign test is often applied to paired data. Consider a study with two settings, A and 
B (e.g., A = after physical therapy and B = before). If we let X; and Y; denote the ith individual’s 
response (e.g., range of motion) in settings A and B, respectively, we know from previous discussions 
to examine the within-subject differences D; = X; — Y;. A “positive sign,” meaning D; > 0, indicates 
that the ith subject got a higher response under setting A than with setting B. We can test for a 
treatment effect in favor of setting A—for example, that physical therapy increases range of motion— 
by examining the hypotheses 
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Ao: ip = 0 
Ha: [tp > 0 


Equivalently, if we define p = P(X; > Y;) = P(D; > 0), then we may test Ho: p = .5 versus H,: p > .5 
as at the end of Example 14.2. 


Example 14.3 Do technological improvements “slow down” as a product spends more time on the 
market? The 2017 MIT paper “Exploring the Relationship Between Technological Improvement and 
Innovation Diffusion: An Empirical Test” attempts to answer this question by quantifying the 
improvement rate (%/year) of 18 items, from washing machines to laptops, during both the early stage 
and late stage of each item’s market presence. The data is summarized below. 


Early stage Late stage Difference Early stage Late stage Difference 
12.96 12.65 —0.31 15.49 18.80 43.31 
5.50 3.52 —1.98 34.30 25.73 —8.57 
5.50 5.00 —0.50 32.15 32.80 +0.65 
3.90 3.10 —0.80 16.62 15.84 —0.78 
3.52 2.93 —0.59 32.37 36.33 +3.96 
3.90 3.10 —0.80 24.25 27.15 +2.90 
10.53 14.41 +3.88 3.87 3.92 +0.05 
38.18 31.79 —6.39 180.84 84.51 —96.33 
31.79 36.33 +4.54 26.80 47.52 +20.72 


With D = difference in improvement rate (late stage minus early stage), the authors tested the 
hypotheses Ho: jip = 0 versus Hy: [tp <0; the alternative hypothesis aligns with the prevailing theory 
of a late-stage slowdown. Eight of the 18 differences are positive, and the lower-tailed P-value is the 
chance of observing eight or fewer positive differences if the true median difference is zero (so 
positive and negative are each equally likely): 


P-value = P(K <8 when K ~ Bin(18, .5)) = B(8; 18, .50) = .408 


The very large P-value, consistent with the close split (8 vs. 10) between negative and positive 
differences, suggest that Hp should not be rejected. The data does not lend credence to the slowdown 
theory of technological improvement. a 


Exercises: Section 14.1 (1-10) 


1. Example 1.4 presented data on the starting Treating these 25 houses as a random 


salaries of n = 38 civil engineering gradu- 
ates. Use this data to construct a 95% 
confidence interval for the population lower 
quartile (25th percentile) of civil engineer- 
ing starting salaries. 


2. The following Houston, TX house prices 
($1000’s) were extracted from zillow.com 
in August 2019: 


162 165 167 188 189 194 200 233 236 247 
248 257 258 286 290 307 330 345 377 389 
459 460 513 569 1399 


sample of all available homes in Houston, 
calculate a 90% upper confidence bound for 
the true median home price in that city. 


. Based on a random sample of 40 observa- 


tions from any continuous population, 
construct a confidence interval formula for 
the population median that has confidence 
level (at least) 95%. 


. Let 43 denote the 30th percentile of a 


population. Find the smallest sample size 
n for which P(Y; <3 <Y,) is at least .99. 
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. Child development 


In other words, determine the smallest 
sample size for which the span of the 
sample, min to max, is a 99% CI for 73. 


. Refer back to Exercise 2. The median home 


value in the state of Texas in August 2019 
was $197,000. Use the data in Exercise 2 to 
test the hypothesis that the true median 
home price in Houston exceeds this value, 
at the .05 significance level. 


. The following data on grip strength (N) for 


42 individuals was read from a graph in the 
article “Investigation of Grip Force, Normal 
Force, Contact Area, Hand Size, and Han- 
dle Size for Cylindrical Handles” (Human 
Factors 2008: 734-744): 


16 18 20 26 33 41 54 56 66 
68 87 91 95 98 106 109 111 118 
127, 131 135 145 147 149 151 168 172 
183 189 190 200 210 220 229 230 
233 238 244 259 294 329 403 


a. What does the data suggest about the 
population distribution of grip strength? 
Why might the median be a more 
appropriate measure of “typical” grip 
strength than the mean? 

b. Test the hypothesis that the population 
median grip strength is less than 
170 N at the .05 significance level. 


specialists widely 
believe that diverse recreational activities 
can improve the social and emotional con- 
duct of children. The article “Influence of 
Physical Activity on the Social and Emo- 
tional Behavior of Children Aged 2-5 
Years” (Cuban J. of Gen. Integr. Med. 
2016) reported a study of 25 young chil- 
dren diagnosed with social and emotional 
behavior problems. The children partici- 
pated in a physical activity regimen for one 
year, and each child was measured for 
negative social behavior indicators (tan- 
trums, crying, etc.) both before and after the 
regimen. Lower scores indicate improve- 
ment; the children’s changes in score (post 
minus pre) are summarized below. 


Score change < 0 


14 Nonparametric Methods 


Score change = 0 Score change > 0 
17 1 7 


Use the sign test to determine whether the 
data indicates a statistically significant 
improvement in scores at the « = .05 level. 
[Hint: Delete the one 0 observation and 
work with the other 24 differences; this is a 
common way to address “ties” in pre- and 
post-intervention scores.] 


. Consider the following scene: an actor 


watches a man in a gorilla suit hide behind 

haystack A. The actor leaves the area, and 

the “gorilla” moves to behind haystack B, 

after which the actor re-enters. A video of 

this scene was shown to 22 (real) apes, and 
eye-tracking software was used to track 
which haystack they stared at more after the 
actor re-entered. “False belief” theory says 
that the viewer will look at haystack A 
more, matching the actor’s mistaken belief 
that the gorilla is still there, despite the 
viewer knowing better. In the study, 17 of 
the apes spent more time looking at hay- 
stack A (“Great Apes Anticipate That Other 

Individuals Will Act According to False 

Beliefs,” Science, 7 October 2016). 

a. Use a sign test to determine whether the 
true median looking-time difference 
(A — B) supports the false belief theory 
in primates. 

b. Since the data is paired (time looking at 
each of two haystacks for the 22 apes), 
the paired ¢ procedure of Chapter 10 
might also be applicable. What infor- 
mation would that test require, and what 
assumptions must be met? 


. The article “Hitting a High Note on Math 


Tests: Remembered Success Influences 
Test Preferences” (J. of Exptl. Psych. 2016: 
17-38) reported a study in which 130 par- 
ticipants were administered two math tests 
(in random order): a shorter, more difficult 
exam and a longer, easier one. Participants 
were then asked to estimate how much time 
they had spent on each exam. Let jip denote 
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the true median difference in time estimates Protein Metabolism,” Amer. J. of Clinical 
(short test minus long test). Test the Nutr. 2009: 1244-1251): 

hypotheses Ho: [tp = 0 versus Ha: [tp > 0, 

with H,, supporting the psychological con- Subject 

tention that people perceive easier tasks to Diet 1 2 3 4 5 

be quicker, using a sign test based on the LF. 1753.7 1604.4 1576.5 1279.7 1754.2 
fact that 109 participants gave positive Standard 1755.0 1691.1 1697.1 1477.7 1785.2 
differences and 21 gave negative differ- Subject 

ences. (In fact, test-takers required 3.2 min Diet 6 7 8 
longer, on average, to complete the LF. 1695.5 1700.1 1717.0 


. Standard 1669.7 1901.3 1735.3 
lengthier exam.) 


10. Consider the following data on resting Let jip = true median difference in REE (IF 


energy expenditure (REE, in calories per minus standard diet). Test the hypotheses 
day) for eight subjects both while on an Ho: jp =0 versus Hy: jip<O at the .05 
intermittent fasting regimen and while on a significance level using the sign test. 


standard diet (“Intermittent Fasting Does 
Not Affect Whole-Body Glucose, Lipid, or 


14.2 One-Sample Rank-Based Inference 


The previous section introduced the sign test for assessing the plausibility of Ho: fi = fig, where fA 
denotes a population median. The basis of the test was to consider the quantities X; — jig, ..., Xn — Mo 
and count how many of those differences are positive. Thus the original sample is reduced to a 
collection of n “signs” (+ or —), If Ho is true, there should be roughly equal numbers of +’s and —’s, 
and the test statistic measures the degree of discrepancy from that 50-50 balance. The sign test is 
applicable to any continuous population distribution. 

Here, we consider a test procedure that is more powerful than the sign test but requires an additional 
distributional assumption. Suppose a research chemist replicated a particular experiment a total of 10 
times and obtained the following values of reaction temperature (°C), ordered from smallest to largest: 


—.76 —-19 -—05 57 130 2.02 2.17 246 2.68 3.02 


The distribution of reaction temperature is of course continuous. Suppose the investigator is 
willing to assume that this distribution is symmetric, in which case the two halves of the distribution 
on either side of j are mirror images of each other. (Provided that the mean yw exists for this 
symmetric distribution, j¢ = p and they are both the point of symmetry.) The assumption of symmetry 
may at first seem quite bold, but remember that we have frequently assumed a normal distribution for 
inference procedures. Since a normal distribution is symmetric, the assumption of symmetry without 
any additional distributional specification is actually a weaker assumption than normality. 

Let’s now consider testing the specific null hypothesis that 4 = 0. Symmetry implies that a 
temperature of any particular magnitude, say 1.50, is no more likely to be positive (+1.50) than to be 
negative (—1.50). A glance at the data above casts doubt on this hypothesis; for example, the sample 
median is 1.66, which is far larger in magnitude than any of the three negative observations. 

Figure 14.1 shows graphs of two symmetric pdfs, one for which Ho is true and the other for which 
the median of the distribution considerably exceeds 0. In the first case we expect the magnitudes of 
the negative observations in the sample to be comparable to those of the positive sample observations. 
However, in the second case observations of large absolute magnitude will tend to be positive rather 
than negative. 
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0 0 @ 
Figure 14.1 Distributions for which (a) f¢ = 0; (b) fi >> 0 


A Rank-Based Test Statistic 

For the sample of ten reaction temperatures, let’s for the moment disregard the signs of the obser- 
vations and rank their magnitudes (i.e., absolute values) from | to 10, with the smallest getting rank 1, 
the second smallest rank 2, and so on. Then apply the original sign of each observation to the 
corresponding rank, so some signed ranks will be negative (e.g., —3) whereas others will be positive 
(e.g., +8). 


Absolute value .0S 19 ny) 76 1.30 2.02 2.17 2.46 2.68 3.02 
Rank 1 2 3 4 5 6 7 8 9 10 
Signed rank =I =P 3 -4 5 6 7 8 9 10 


The test statistic for the procedure developed in this section will be S 4 = the sum of the positively 
signed ranks. For the given data, the observed value of S, is 


s4 = sum of the positive ranks = 3+5+6+7+8+4+9-+10 = 48 


When the median of the distribution is much greater than 0, most of the observations with large 
absolute magnitudes should be positive, resulting in positively signed ranks and a large value of s+. 
On the other hand, if the median is 0, magnitudes of positively signed observations should be 
intermingled with those of negatively signed observations, in which case s will not be very large. 
(As noted before, this characterization depends on the underlying distribution being symmetric.) Thus 
we should reject Ho: 4 = 0 in favor of H,: 4 > 0 when s, is “quite large’—the rejection region 
should have the forms, > c. 

The critical value c should be chosen so that the test has a desired significance level (type I error 
probability), such as .05 or .01. This necessitates finding the distribution of the test statistic $ when 
the null hypothesis is true. Let’s consider n = 5, in which case there are 2° = 32 ways of applying 
signs to the five ranks 1, 2, 3, 4, and 5 (each rank could have a — sign or a + sign). The key is that 
when Hb is true, any collection of five signed ranks has the same chance as does any other collection. 
That is, the smallest observation in absolute magnitude is equally likely to be positive or negative, the 
same is true of the second smallest observation in absolute magnitude, and so on. Thus the collection 
—1, 2, 3, —4, 5 of signed ranks is just as likely as the collection 1, 2, 3, 4, —5, and just as likely as any 
one of the other 30 possibilities. 

Table 14.1 lists the 32 possible signed-rank sequences when n = 5 along with the value s for 
each sequence. This immediately gives the “null distribution” of S.. For example, Table 14.1 shows 
that three of the 32 possible sequences have s, = 8, so P(S, = 8 when Ap is true) = 3/32. This null 
distribution appears in Table 14.2 and Figure 14.2. Notice that it is symmetric about 7.5; more 
generally, S = 8 is symmetrically distributed over the possible values 0, 1, 2,...,n(n+1)/2 when 
Hp is true. This symmetry will be important in relating the rejection region of lower-tailed and two- 
tailed tests to that of an upper-tailed test. 
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Table 14.1 Possible signed-rank sequences for n = 5 


Sequence 

1 2 3 4 5 
+1 2 3 4 5 
-l +2 3 4 5 
-1 -—2 +3 —4 —5 
+1 +2 3 4 5 
+1 =2 +3 —4 =5 

1 2 3 4 5 
+1 2 3 4 5 
-1 +2 =3 +4 =5 
=I =2 +3 +4 =5 
+1 +2 3 4 5 
+1 =2 +3 +4 =5 
=] +2 +3 —4 =5 
+1 +2 +3 —4 =5 

1 2 3 4 +5 
+1 2 3 4 +5 
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Table 14.2 Null distribution of S$, when n = 5 


Sy 
P(s+) 
Sy 


P(S+) 


0 
1/32 
8 
3/32 


0.10 


0.08 


0.06 


0.04 


0.02 


0.00 


1 


1/32 


9 


3/32 


Probability 


Sy Sequence Sy 
0 1 2 3 4 +5 7 
1 = =2 +3 —4 +5 8 
2 +1 2 3 4 +5 8 
3 +1 =2 +3 —4 +5 9 
3 -1 +2 +3 —4 +5 10 
4 +1 +2 +3 —4 +5 11 
4 -1 +2 +3 +4 =5 9 
5 +1 +2 +3 +4 —5 10 
6 1 2 3 +4 +5 9 
7 +1 =2 =3 +4 +5 10 
7 1 2 3 +4 +5 11 
8 =1 =2 +3 +4 +5 12 
5 +1 +2 =3. +4 +5 12 
6 +1 =D +3 +4 +5 13 
5 =A +2 +3 +4 +5 14 
6 +1 +2 +3 +4 +5 15 
2 3 4 5 6 7 
1/32 2/32 2/32 3/32 3/32 3/32 
10 11 12 13 14 15 
3/32 2/32 2/32 1/32 1/32 1/32 
S, 
5 10 15 


Figure 14.2 Null distribution of S when n = 5 


For n = 10 there are 2'° = 1024 possible signed-rank sequences, so a listing would involve much 
effort. Each sequence, though, would have probability 1/1024 when Ho is true, from which the 
distribution of S, when Hp is true can be obtained. 

We are now in a position to determine a rejection region for testing Ho: ff = 0 versus H,: ft > 0 
that has a suitably small significance level «. For the case n = 5, consider the rejection region 
R= {s4:54 >13} = {13, 14, 15}. Then 
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o& = P(reject Hy when Hp is true) 
= P(S, = 13,14, or 15 when Hp is true) 
= 1/324 1/32+1/32 = 3/32 
= .094 


so that R = {13, 14, 15} specifies a test with approximate level .1. For the rejection region {14, 15}, 
a& = 2/32 = .063. For the sample x; = .58, x. = 2.50, x3 = —.21, x4 = 1.23, x5 = .97, the signed- 
rank sequence is —1, +2, +3, +4, +5, sos, = 14 and at level .063 (or anything higher) Hp would be 
rejected. 


The Wilcoxon Signed-Rank Test 

Because the underlying distribution is assumed symmetric, 4. = jt, so we will state the hypotheses of 
interest in terms of yu rather than ji.' When the hypothesized value of ju is ji, the absolute differences 
|x1 — Lo|,---; |X — Mo| must be ranked from smallest to largest. 


WILCOXON SIGNED-RANK Null hypothesis: Ho: “ = Ug 
TEST Test statistic value: s, = the sum of the ranks associated with 
positive (x; — Ho)’s 


Alternative Hypothesis Rejection Region for Level « Test 
A: Lb > Uo Se >C] 

Hai U< Uo 4 <n(n+1)/2-—c, 

Ay: uF Uo either s, >c ors, <n(n+1)/2-—c 


where the critical values c, and c obtained from Appendix Table A.11 
satisfy P(S; >c,) + a and P(S, >c) & a/2 when Hp is true. 


Example 14.4 A producer of breakfast cereals wants to verify that a filler machine is operating 
correctly. The machine is supposed to fill one-pound boxes with 460 g, on average. This is a little 
above the 453.6 g needed for one pound. When the contents are weighed, it is found that 15 boxes 
yield the following measurements: 


454.4 470.8 447.5 453.2 462.6 445.0 455.9 458.2 
461.6 457.3 452.0 464.3 459.2 453.5 465.8 


Does the data provide convincing statistical evidence that the true mean weight differs from 460 g? 
Let’s use the seven-step hypothesis testing procedure outlined earlier in the book. 


1. Parameter: 4 = the true average weight of all such cereal boxes 

2. Hypotheses: Ho: « = 460 versus H,: up ~ 460 

3. It is believed that deviations of any magnitude from 460 g are just as likely to be positive as negative 
(in accord with the symmetry assumption), but the distribution may not be normal. Therefore, the 
Wilcoxon signed-rank test will be used to see if the filler machine is calibrated correctly. 


'If the tails of the distribution are “too heavy,” as is the case with the Cauchy distribution, then ju will not exist. In such 
cases, the Wilcoxon test will still be valid for tests concerning [U. 
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4. The test statistic will be s, =the sum of the ranks associated with positive (x; — 460)’s 

5. From Appendix Table A.11, P(S > 95) = P(S, <25) = .024 when Hp is true, so the two-tailed 
test with approximate level .05 rejects Ho when either s, >95 or s, <25 [the exact « is 
2(.024) = .048]. 

6. Subtracting 460 from each measurement gives 


—5.6 10.8 =12.5 -6.8 2.6 —-15.0 =41 -1.8 
1.6 eee —8.0 4.3 —8 -6.5 5.8 


The ranks are obtained by ordering these from smallest to largest without regard to sign. 


asi 16 618 (26 27 41 43 56 158 (65 68 (80 |108 12.5 15.0 
magnitude 

Rank 1 2 3 4 5 6 7 8 9 0 U2  B 14 15 
Sign = + = + oe = + = + = = = + = = 


Thus sy, =2+4+7+9+13 = 35. 

7. Since P(S, <30) is not in the rejection region, it cannot be concluded at level .05 that y differs 
from 460. Even at level .094 (approximately .1), Ho cannot be rejected, since s4P(S, <30)= 
P(S >90)= .047 implies that s, values between 30 and 90 are not significant at that level. The 
P-value for this test thus exceeds .1. fi 


Although a theoretical implication of the continuity of the underlying distribution is that ties will not 
occur, in practice they often do because of the discreteness of measuring instruments. If there are 
several data values with the same absolute magnitude, then they are typically assigned the average of 
the ranks they would receive if they differed very slightly from one another. For example, if in Example 
14.4 xg = 458.2 were instead 458.4, then two different values of (x; — 460) would have absolute 
magnitude 1.6. The ranks to be averaged would be 2 and 3, so each would be assigned rank 2.5. 


Large-Sample Distribution of S, 
Figure 14.2 displays the null distribution of S, for n = 5, asymmetric distribution centered at 7.5. It is 
straightforward to show (see Exercise 18) that when Hp is true, 


n(n+ 1) 
4 


_ n(n+1)(2n+1) 


E(S,) = “ 


V(S4) 


Moreover, when n is not small (say, n > 20), Lyapunov’s central limit theorem (Chapter 6, Exercise 
68) implies that S has an approximately normal distribution. Appendix Table A.11 only presents 
critical values for the Wilcoxon signed-rank test form < 20; beyond that, the test may be performed 
using the “large-sample” test statistic 


— Sy —n(n+1)/4 
~ \/n(n+ 1) (2n+ 1)/24 


which has approximately a standard normal distribution when Hp is true. 


The Wilcoxon Test for Paired Data 

When the data consisted of pairs (X1,Y1),.-.,(Xn, Yn) and the differences D; = X; — Yj,...,D, = 
X, — Y, were normally distributed, in Chapter 10 we used a paired ¢ test for hypotheses about the 
expected difference Up. If normality is not assumed, hypotheses about ip can be tested by using the 
Wilcoxon signed-rank test on the D;’s provided that the distribution of the differences is continuous 
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and symmetric. The null hypothesis is Ho: {4p = Ao for some null value Ap (most frequently 0), and 
the test statistic 5, is the sum of the ranks associated with the positive (D; — Ao)’s. 


Example 14.5 Poor sleep including insomnia is common among veterans, particularly those with 
PTSD. The article “Cognitive Behavioral Therapy for Insomnia and Imagery Rehearsal in Combat 
Veterans with Comorbid Posttraumatic Stress: A Case Series” (Mil. Behav. Health 2016: 58-64) 
reports on a pilot study involving 11 combat veterans diagnosed with both insomnia and PTSD. Each 
participant attended eight weekly individual therapy sessions with lessons including sleep education, 
relaxation training, and nightmare re-scripting. Total nightly sleep time (min) was recorded for each 
veteran both before the 8-week intervention and after. In the accompanying table, differences rep- 
resent (sleep time after therapy) minus (sleep time before therapy). 


Subject 1 2 3 4 5 6 7 8 9 10 11 
Before 255 261 257 275 191 528 298 247 314 340 315 
After 330 323 312 308 251 559 261 296 387 386 387 
Difference 715 62 55 33 60 31 = 49 73 46 72 
Signed rank 11 8 6 2 7 1 =o ) 10 4 9 


The relevant hypotheses are Ho: Up = 0 versus H,: Up > 0. Appendix Table A.11 shows that for a 
test with significance level approximately .01, the null hypothesis should be rejected ifs, > 59. The 
test statistic value is 11+8-+--- +9= 63, the sum of every rank except 3, which falls in the 
rejection region. We therefore reject Ho at significance level .01 in favor of the conclusion that the 
therapy regimen increases sleep time, on average, for this population. Figure 14.3 shows R output for 
this test, including the test statistic value (as V) and also the corresponding P-value, which is 
P(S 4 hx2265; hx2009; 63whenHpistrue). 


wilcoxon signed rank test 


data: After and Before 
V = 63, p-value = 0.002441 
alternative hypothesis: true location shift is greater than 0 


Figure 14.3 R output for Example 14.5 | 


Efficiency of the Sign Test and Signed-Rank Test 

When the underlying distribution being sampled is normal, any one of three procedures—the ¢ test, 
the signed-rank test, or the sign test—can be used to test a hypothesis about jy (the point of sym- 
metry). The ¢ test is the best test in this case because among all level « tests it is the one having the 
greatest power (smallest type II error probabilities). On the other hand, neither the ¢ test nor the 
signed-rank test should be applied to data from a clearly skewed distribution, for two reasons. First, 
lack of normality (resp., symmetry) violates the requirements for the validity of the ¢ test (resp., 
signed-rank test). Second, both of these latter test procedures concern the mean yp, and for a heavily 
skewed population the mean is arguably of less interest than the median ji. 


Test procedure Population assumption Parameter of interest Power (assuming normality) 


Sign test Continuous Least powerful 
Signed-rank test Symmetric 


One-sample f test Normal 


| om 


= 
Sr St 


Most powerful 


= 
ll 
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Let us now specifically compare Wilcoxon’s signed-rank test to the ¢ test. Two questions will be 
addressed: 


1. When the underlying distribution is normal, the “home ground” of the ¢ test, how much is lost by 
using the signed-rank test? 

2. When the underlying distribution is not normal, how much improvement can be achieved by using 
the signed-rank test? 


Unfortunately, there are no simple answers to the two questions. The difficulty is that power for the 
Wilcoxon test is very difficult to determine for every possible distribution, and the same can be said 
for the ¢ test when the distribution is not normal. Even if power were easily obtained, any measure of 
efficiency would clearly depend on which underlying distribution was assumed. 

A number of different efficiency measures have been proposed by statisticians; one that many 
statisticians regard as credible is called asymptotic relative efficiency (ARE). The ARE of one test 
with respect to another is essentially the limiting ratio of sample sizes necessary to obtain identical 
error probabilities for the two tests. Thus if the ARE of one test with respect to a second equals .5, 
then when sample sizes are large, twice as large a sample size will be required of the first test to 
perform as well as the second test. Although the ARE does not characterize test performance for small 
sample sizes, the following results can be shown to hold: 


1. When the underlying distribution is normal, the ARE of the Wilcoxon test with respect to the ¢ test 
is approximately .95. 

2. For any distribution, the ARE will be at least .86 and for many distributions will be much greater 
than 1. 


We can summarize these results by saying that, in large-sample situations, the Wilcoxon test is 
never very much less efficient than the ¢ test and may be much more efficient if the underlying 
distribution is far from normal. Although the issue is far from resolved in the case of sample sizes 
obtained in most practical problems, studies have shown that the Wilcoxon test performs reasonably 
and is thus a viable alternative to the ¢ test. In contrast, the sign test has ARE less than .64 with respect 
to the ¢ test when the underlying distribution is normal. (But, again, the sign test is arguably the only 
appropriate test for heavily skewed populations.) 


The Wilcoxon Signed-Rank Interval 
In Section 9.6, we discussed the “duality principle” that links hypothesis tests and confidence 
intervals. Suppose we have a level « test procedure for testing Ho: 0 = 09 versus H,: 0 4 09 based on 
sample data x;,...,X,. If we let A denote the set of all 09 values for which Ho is not rejected, then A is 
a 100(1 — «)% CI for 0.7 

The two-tailed Wilcoxon signed-rank test rejects Hp if s; is either > c or < nin + 1)/2 —c, 
where c is obtained from Appendix Table A.11 once the desired significance level « is specified. For 
fixed x, ..., x,, the 100(1 — «)% signed-rank interval will consist of all fo for which Ho: Lh = [Up is 
not rejected at level «. To identify this interval, it is convenient to express the test statistic S$, in 
another form. 


There are pathological examples in which the set A is not an interval of @ values, but instead the complement of an 
interval or something even stranger. To be more precise, we should really replace the notion of a CI with that of a 
confidence set. In the cases of interest here, the set A does turn out to be an interval. 
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PROPOSITION S, =the number of pairwise averages (X;+X;)/2 with i < j 
that are > Lo 
(These pairwise averages are known as Walsh averages.) 


That is, if we average each x; in the list with each x; to its left, including (x; + x;)/2 = x;, and count the 
number of these averages that are > so, s+ results. In moving from left to right in the list of sample 
values, we are simply averaging every pair of observations in the sample—again, including (x; + x;)/2— 
exactly once, so the order in which the observations are listed before averaging is not important. The 
equivalence of the two methods for computing s, is not difficult to verify. The number of pairwise 


averages is (:) +n = n(n+ 1)/2. If either too many or too few of these pairwise averages are > Lo, 
Ho is rejected. 


Example 14.6 The following observations are values of cerebral metabolic rate for rhesus monkeys: 
x, = 4.51, x2 = 4.59, x3 = 4.90, x4 = 4.93, x5 = 6.80, x6 = 5.08, x7 = 5.67. The 28 pairwise averages 
are, in increasing order, 


4.51 4.55 4.59 4.705 4.72 4.745 4.76 4.795 4.835 4.90 
4.915 4.93 4.99 5.005 5.08 5.09 5.13 5.285 5.30 5.375 
5.655 5.67 5.695 5.85 5.865 5.94 6.235 6.80 


The first few and the last few of these are pictured on a measurement axis in Figure 14.4. 


At level .0469, Hy is 
not rejected for ly in here 


Figure 14.4 Plot of the data for Example 14.6 


Because of the discreteness of the distribution of S,, « = .05 cannot be obtained exactly. The 
rejection region {0, 1, 2, 26, 27, 28} has « = .046, which is as close as possible to .05, so the level is 
approximately .05. Thus if the number of pairwise averages > [lo is between 3 and 25, inclusive, Ho 
is not rejected. As displayed in Figure 14.4, the approximate 95% CI for mw is (4.59, 5.94); the 
endpoints are the 3rd-lowest and 3rd-highest (3rd and 26th ordered) Walsh averages. 1] 


In general, once the pairwise averages are ordered from smallest to largest, the endpoints of the 
Wilcoxon interval are two of the “extreme” averages. To express this precisely, let the smallest 
pairwise average be denoted by X,1), the next smallest by x(2),..., and the largest by X(n(n + 1)/2)- 
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PROPOSITION If the level « Wilcoxon signed-rank test for Ho: 1 = Lg versus Hy: A Up is 
to reject Hj ifeithers, > cors, < nin + 1)/2 —c, thena 10001 — «)% 
CI for yp is 


(Zaes 1)/2-64-1)s %(0) (14.2) 


In words, the interval extends from the kth smallest pairwise average to the Ath largest average, where 
k =n(n+1)/2 —c+1. Appendix Table A.12 gives the values of c that correspond to the usual con- 
fidence levels forn = 5,6, ...,25.InR, the wilcox. test function applied to a vector x containing the 
sample data will return this signed-rank interval if the user includes the option conf.int = T. 


Example 14.7 (Example 14.6 continued) For n = 7, an 89.1% interval (approximately 90%) is 
obtained by using c = 24, since the rejection region {0, 1, 2, 3, 4, 24, 25, 26, 27, 28} has « = .109. 
The interval is (X(2g_24 +1), X(24)) = (X(s),X(24)) = (4.72, 5.85), which extends from the fifth smallest 
to the fifth largest pairwise average. a 


The derivation of the signed-rank interval depended on having a single sample from a continuous 
symmetric distribution with mean (median) yw. When the data is paired, the interval constructed from 
the Walsh averages of the differences dj, do, ..., d, is a CI for the mean (median) difference Lp. 

For n > 20, the large-sample approximation to the Wilcoxon test based on standardizing S$, gives 
an approximation to c in (14.2). The result for a 100(1 — «)% interval is 


n(n+ 1) n(n+ 1)(2n+ 1) 
cy a * + Zu/2 ———»4, 
The efficiency of the Wilcoxon interval relative to the f interval is roughly the same as that for the 
Wilcoxon test relative to the ¢ test. In particular, for large samples when the underlying population is 
normal, the Wilcoxon interval will tend to be slightly longer than the ¢ interval, but if the population is 
quite nonnormal (e.g., symmetric but with heavy tails), then the Wilcoxon interval will tend to be 
much shorter than the ¢ interval. 


Exercises: Section 14.2 (11-24) 


11. Reconsider the situation described in 14. A random sample of 15 automobile 


Exercise 34(a) of Section 9.2, and use the mechanics certified to work on a certain 
Wilcoxon test with «= .05 to test the type of car was selected, and the time (in 
specified hypotheses. minutes) necessary for each one to diag- 

12. Use the Wilcoxon test to analyze the data nose a particular problem was determined, 
given in Example 9.12. resulting in the following data: 

13. The following pH measurements at a pro- Tae (ae ee ae ee 
posed water intake site appear in the 2011 31.9 53.2 12.5 232 88 249 302 
report “Sacramento River Water Quality 
Assessment for the Davis-Woodland Water Use the Wilcoxon test at significance level 
Supply Project”: .10 to decide whether the data suggests that 

true average diagnostic time is less than 
7.20 7.24 7.31 7.38 7.45 7.60 7.86 30 min. 


15. Both a gravimetric and a spectrophotomet- 
ric method are under consideration for 
determining phosphate content of a 


Use the Wilcoxon signed-rank test to deter- 
mine whether the true mean pH level at this 
site exceeds 7.3 with significance level .05. 
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16. 


particular material. Twelve samples of the 
material are obtained, each is split in half, 
and a determination is made on each half 
using one of the two methods, resulting in 
the following data: 


Sample 1 2 3 4 
Grav. 54.7 58.5 66.8 46.1 
Spec. 55.0 55.7 62.9 45.5 
Sample 5 6 7 8 
Grav. 52.3 74.3 92.5 40.2 
Spec. 51.1 75.4 89.6 38.4 
Sample 9 10 11 12 
Grav. 87.3 74.8 63.2 68.5 
Spec. 86.8 72.5 62.3 66.0 


Use the Wilcoxon test to decide whether 
one technique gives on average a different 
value than the other technique for this type 
of material. 


Fifty-three participants performed a series of 
tests in which a small “zap” was delivered to 
one compass point, selected at random, on a 
joystick in their hand. In one setting, subjects 
were told to move the joystick in the same 
direction of the zap; in another setting, they 
were told to move the joystick in the direc- 
tion opposite to the zap. A series of trials was 
performed under each setting, and the num- 
ber of correct moves under both settings was 
recorded. (“An Experimental Setup to Test 
Dual-Joystick Directional Responses to 
Vibrotactile Stimuli,” JEEE Trans. on Hap- 
tics 2018.) 


a. The authors performed a Wilcoxon 
signed-rank test on the paired differences 
(number correct in same direction minus 
number correct in opposite direction, 
which is discrete but won't greatly 
impact the analysis). The resulting test 
statistic value was s, = 695. Test 
Ho: Lp = O versus H,: Up # 0 at the .10 
significance level using the large-sample 
version of the test. 

b. The same article also explored whether 
participants would make more correct 
moves if the “zaps” were instead deliv- 
ered by a glove they wore while grasp- 
ing the joystick. Again for n=53 
subjects, the difference (number correct 
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with joystick minus number correct with 
glove) was computed, and the signed- 
rank test statistic value was s, = 136. 
Perform a two-sided large-sample test, 
and interpret your findings. 


17. The article “Punishers Benefit from Third- 


Party Punishment in Fish” (Science, 8 Jan 
2010: 171) describes an experiment meant to 
simulate behavior of cleaner fish, so named 
because they eat parasites off of “client” fish 
but will sometimes take a bite of the client’s 
mucus instead. (Cleaner fish prefer the mucus 
over the parasites.) Eight female cleaner fish 
were provided bits of prawn (preferred food) 
and fish-flake (less preferred), then a male 
cleaner fish chased them away. One minute 
later, the process was repeated. 


a. The following data on the amount of 
prawn eaten by each female in the two 
rounds is consistent with information in 
the article. Use Wilcoxon’s signed-rank 
test to determine whether female fish eat 
less of their preferred food, on average, 
after having been chased by a male. 


Female 1 2 3 4 5 6 7 8 


Ist trial §=.207 .215) 103) 182.282 © .228 =.152 .293 
2nd trial .164 .033 .092 .003 .115 .250 .056 .247 


b. The researchers recorded the same 
information on the male cleaner fish (the 
chasers), resulting in a signed-rank test 
statistic value of s, = 28. Does this 
provide evidence that males increase 
their average preferred food consump- 
tion the second time around? 


18. The signed-rank statistic can be represented as 


S, =1-U,;+2-U,+---+n-U, where 
U; = 1 ifthe sign of the (x; — Mo) with the ith 
largest absolute magnitude is positive (in 
which case iis included in S$, ) and U; = Oif 
this value is negative (i = 1, 2, 3, ..., ”). 
Furthermore, when Hp is true, the U;’s are 
independent Bernoulli rvs with p = .5. 


a. Use this representation to obtain the 
mean and variance of Si; when Ho is 
true. [Hint: The sum of the first n posi- 
tive integers is n(n + 1)/2, and the sum 
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of the squares of the first n positive 
integers is n(n + 1)(2n+1)/6.] 

b. A particular type of steel beam has been 
designed to have a compressive strength 
(Ib/in”) of at least 50,000. An experi- 
menter obtained a random sample of 25 
beams and determined the strength of 
each one, resulting in the following data 
(expressed as deviations from 50,000): 
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Suppose the X;’s are ranked from 1 to 
n. Then when H, is true, larger ranks tend 
to occur later in the sequence, whereas if Ho 
is true, large and small ranks tend to be 


mixed together. Let R; be the rank of X; and 


consider the test statistic D = $7", (Rj — i)”. 


Then small values of D give support to 
H, (e.g., the smallest value is O for 
R,; = 1, Ro =2,...,R, =n), so Ho should 
be rejected in favor of H, if d < c. When 


-10 -27 36 -55 73-77 -81 ‘ 
Sy. coe <be.  T iee oe ane Ho is true, any sequence of ranks has 
150 -155 —-159 165 178 —183 —192 probability 1/n!. Use this to find c for which 
199 -212 -217 -229 the test has a level as close to .10 as pos- 


Carry out a test using a significance level 
of approximately .01 to see if there is 
strong evidence that the design condition 
has been violated. 


sible in the case n = 4. [Hint: List the 4! 
rank sequences, compute d for each one, 
and then obtain the null distribution of 
D. See the Lehmann book in the bibliog- 
raphy for more information. ] 


19. Reconsider the calorie-burning data in 21. Obtain the 99% signed-rank interval for true 

Exercise 10 from Section 14.1. average pH using the data in Exercise 13. 

a. Utilize the Wilcoxon signed-rank pro- 22, Obtain a 95% signed-rank interval for true 
cedure to test Ho: Mp =0_ versus average diagnostic time using the data in 
A: Mp < 0 for the population of REE Exercise 14. [Hint: Try to compute only those 
differences (IF minus standard diet). pairwise averages having relatively small or 
What assumption is required here that large values, rather than all 120 averages. ] 
wee TIE Meeesealy for the sign test? 23. Obtain a CI for wp of Exercise 17 using the 

b. Now apply ‘the paired 7 test to the data given there; your confidence level 
hypotheses in part (a). What extra should be roughly 95%. 
assumptions are required? : . 

c. Compare the results of the three tests 24. The following observations ans COP = 
(sign, signed-rank, and paired 1) and contents (%) 10H a sample of Bidri artifacts 
discuss what you find. (a type of ancient Indian metal handicraft) 

at the Victoria and Albert Museum in 
20. Suppose that observations X,, Xo, ..., X, are London (“Enigmas of Bidri,” Surface Engr. 


made on a process at times 1, 2, ..., n. On 
the basis of this data, we wish to test 

Ho: the X;’s constitute an independent and 
identically distributed sequence 

versus 

H,: X;,, tends to be larger than X; for i = 1, 
..., M (an increasing trend) 


2005: 333-339): 2.4, 2.7, 5.3, and 10.1. 
What confidence levels are achievable for 
this sample size using the signed-rank 
interval? Select an appropriate confidence 
level and compute the interval. 


14.3. Two-Sample Rank-Based Inference 


When at least one of the sample sizes in a two-sample problem is small, the ¢ test requires the 
assumption of normality (at least approximately). There are situations, though, in which an investi- 
gator would want to use a test that is valid even if the underlying distributions are quite nonnormal. 
We now describe such a test, called the Wilcoxon rank-sum test. An alternative name for the 
procedure is the Mann-Whitney test, although the Mann—Whitney test statistic is sometimes 
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expressed slightly differently from that of the Wilcoxon test. The Wilcoxon test procedure is 
“distribution-free” because it will have the desired level of significance for a very large collection of 
underlying distributions rather than just the normal distribution. 


ASSUMPTIONS X1, ..., Xm and Yj, ..., Y, are two independent random samples from con- 
tinuous distributions with means sj; and 2, respectively. The X and Y dis- 
tributions have the same shape and spread, the only possible difference 
between the two being in the values of 4, and po. 


When Ao: Lt; — Hy = Ao is true, the X distribution is shifted by the amount Ag to the right of the 
Y distribution; i.e., fy(x) = fy(x — Ao). When Hp is false, the shift is by an amount other than Ao; 
note, though, that we still assume the two distributions only differ by a shift in means. This 
assumption can be difficult to verify in practice, but the Wilcoxon rank-sum test is nonetheless a 
popular approach to comparisons based on small samples. 


A Rank-Based Test Statistic 

Let’s first test Ho: “; — Ll, = 0; then the X’s and Y’s are identically distributed when Hp is true. 
Consider the case n, = 3, n2 = 4. Denote the observations by x), x2, and x3 (the first sample) and y,, 
Y2, y3, and y4 (the second sample). If 1, is actually much larger than 2, then most of the observed x’s 
will be larger than the observed y’s. However, if Ho is true, then the values from the two samples 
should be intermingled. The test statistic will quantify how much intermingling there is in the two 
samples. 

To begin, pool the x’s and y’s into a single combined sample of size m + n = 7 and rank these 
observations from smallest to largest, with the smallest receiving rank | and the largest, rank 7. If 
most of the large ranks (or most of the small ranks) were associated with x observations, we would 
begin to doubt Ho. This suggests the test statistic 


W = the sum of the ranks in the combined sample associated with X observations (14.3) 


For the values of m and n under consideration, the smallest possible value of Wis 1 + 2 + 3 = 6 (if 
all three x’s are smaller than all four y’s), and the largest possible value is 5 + 6 + 7 = 18 (if all three 
x’s are larger than all four y’s). 

As an example, suppose x; = —3.10, x2. = 1.67, x3 = 2.01, y; = 5.27, y2 = 1.89, y3 = 3.86, and 
y4 = .19. Then the pooled ordered sample is —3.10, .19, 1.67, 1.89, 2.01, 3.86, and 5.27. The X ranks 
for this sample are 1 (for —3.10), 3 (for 1.67), and 5 (for 2.01), giving w=1+3+5=9. 

The test procedure based on the statistic (14.3) requires knowledge of the null distribution of 
W. When Hp is true, all seven observations come from the same population. This means that under 
Ho, any possible triple of ranks associated with the three x’s—such as (1, 4, 5), (3, 5, 6), or (5, 6, 7)— 
has the same probability as any other possible rank triple. Since there are (3) = 35 possible rank 
triples, under Hp each rank triple has probability 1/35. From a list of all 35 rank triples and the 
W value associated with each, the null distribution of W can immediately be determined. For example, 
there are four rank triples for which W = 11—(1, 3, 7), C1, 4, 6), (2, 3, 6), and (2, 4, 5)—so 
P(W = 11) = 4/35. The complete sampling distribution appears in Table 14.3 and Figure 14.5. 


Table 14.3 Probability distribution of W when Ap is true (n; = 3, nz = 4) 
w 6 7 8 9 10 11 12 13 14 15 16 17 18 
pw) 135 61/35 2/35) 3/385) 4/385 4/385 5/35 4/35 4/35 3/85 2/85 1/85 1/35 
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Probability 
0.15 
0.10 
0.05 
0.00 w 
6 8 10 12 14 16 18 


Figure 14.5 Sampling distribution of W when Ap is true (nm, = 3, nz = 4) 


Suppose we wished to test Ho: Wy — fo = O against H,: Wy; — U2 < 0 based on the example data 
given previously, for which the observed value of W was 9. Then the P-value associated with the test 
is the chance of observing a W-value of 9 or lower, assuming Hp is true. Using Table 14.3, 


7 
P-value = P(W <9 when A) is true) = P(W = 6,7, 8,9) = 35— 2 


We would thus not reject Hp at any reasonable significance level. 
Constructing the null sampling distribution of W manually can be tedious, since there are generally 


ey. possible arrangements of ranks to consider. Software will provide w and the associated 
P-value quickly, though various packages perform the P-value calculation slightly differently. 


The Wilcoxon Rank-Sum Test 

The null hypothesis Ho: ft; — [ly = Ao is handled by subtracting Ap from each X; and using the 
(X; — Ap)’s as the X;’s were previously used. The smallest possible value of the statistic W is 
14+2+4---+m=m(m+4 1)/2, which occurs when the (X;— Ao)’s are all to the left of the Y sample. 
The largest possible value of W occurs when the (X;— Ao)’s lie entirely to the right 
of the Y’s; in this case, W=(n+1)+---+(m-+n) = (sum of first m+n integers) — 
(sum of first n integers), which gives m(m+ 2n+ 1)/2. As with the special case m = 3, n = 4, the 
null distribution of W is symmetric about the value that is halfway between the smallest and largest 
values; this middle value is m(m + n + 1)/2. Because of this symmetry, probabilities involving lower- 
tail critical values can be obtained from corresponding upper-tail values. 
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WILCOXON RANK-SUM — Null hypothesis: Ho: 14; — J = Ao 
TEST ii 


Test statistic value: w = lis 
i=l 


where r; = rank of (x; — Ag) in the combined sample 
of m +n (x — Ao)’s and y’s 


Alternative Hypothesis Rejection Region for Level « Test 
Aa: by — by > Ao w>c| 

Aa: Ly — by < Ao w<m(m+n+1)-—c, 

Hy: by — by # Ao either w>c or w<m(m+n+1)—c 


where P(W > c, when Hp is true) © «, P(W > c when Hp is true) & «/2 


Because W has a discrete probability distribution, there will not usually exist a critical value corre- 
sponding exactly to one of the usual significance levels. Appendix Table A.13 gives upper-tail critical 
values for probabilities closest to .05, .025, .01, and .005, from which level .05 or .01 one- and two- 
tailed tests can be obtained. The table gives information only for3 < m < n < 8. To use the table, 
the X and Y samples should be labeled so thatm <_n. Ties are handled as suggested for the signed- 
rank test in the previous section. 


Example 14.8 The urinary fluoride concentration (parts per million) was measured both for a sample 
of livestock grazing in an area previously exposed to fluoride pollution and for a similar sample 
grazing in an unpolluted region: 


Polluted 21.3 [11] 18.7 [7] 23.0 [12] 17.1 Bl 16.8 [2] 20.9 [10] 19.7 [8] 
Unpolluted 14.2 [1] 18.3 [5] 17.2 [4] 18.4 [6] 20.0 [9] 


The values in brackets indicate the rank of each observation in the combined sample of 12 values. 
Does the data indicate strongly that the true average fluoride concentration for livestock grazing in the 
polluted region is larger than for the unpolluted region? Let’s use the Wilcoxon rank-sum test at level 
a= .01. 


1. The sample sizes here are 7 and 5. To obtain m <_n, label the unpolluted observations as the x’s 


(x; = 14.2, ..., x5 = 20.0) and the polluted observations as the y’s. Thus the parameters are 


Ht, = the true average fluoride concentration without pollution 
Hz = the true average concentration with pollution 


2. The hypotheses are 


Ho: My — ty = 9 
Hi: fy — Ly < 0 (pollution is associated with an increase in concentration) 


3. In order to perform the Wilcoxon rank-sum test, we will assume that the fluoride concentration 
distributions for these two livestock populations have the same shape and spread, but possibly 
differ in mean. 

4. The test statistic value is w = a rj, where r; = the rank of x; among all 12 observations. 

5. From Appendix Table A.13 with m=5 and n=7, P(W > 47 when Hp is true) + .01. The 
critical value for the lower-tailed test is therefore m(m + n + 1) — 47 = 5(13) — 47 = 18; Ho will 
now be rejected if w < 18. 
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6. The computed Wis w=r,+m+---+r5=14+5+44+64+9=25. 

7. Since 25 is not < 18, Ho is not rejected at (approximately) level .01. The data does not provide 
convincing statistical evidence at the .01 significance level that average fluoride concentration is 
higher among livestock grazing in the polluted region. a 


Alternative Versions of the Rank-Sum Test 

Appendix Table A.13 allows us to perform the Wilcoxon rank-sum test provided that m and n are 
both < 8. For larger sample sizes, a central limit theorem for nonindependent variables can be used to 
show that W has an approximately normal distribution. (The genesis of a bell-shaped curve can even 
be seen in Figure 14.5 where m = 3 and n = 4.) When Ab is true, the mean and variance of W (see 
Exercise 32) are 


_ m(m+n+1) 


E(W) = : _ mn(m+n-+1) 


vi 12 


These suggest that when m> 8 and n> 8 the rank-sum test may be performed using the test 
statistic 
—W-m(mt+n+t+1)/2 


/mn(m+n-+ 1)/12 


which has approximately a standard normal distribution when A is true. 
Some statistical software packages, including R, use an alternative formulation called the Mann— 
Whitney U test. Consider all possible pairs (X;, Y;), of which there are mn. Define a test statistic U by 


U = the number of (X;, Y;) pairs for which X; — Y; > Ao (14.4) 


It can be shown (Exercise 34) that the test statistics U and W are related by U = W — m(m+ 1)/2. 
The mean and variance of U can thus be obtained from the corresponding expressions for W, and the 
normal approximation to W applies equally to U. 

Finally, when using the normal approximation, some slightly tedious algebra can be used to re- 
arrange the standardized version of W so that it looks similar to the two-sample z test statistic: 


_W-E(W) Ri -R 
V(W) oo 
mon 


where R,; and Ry denote the average ranks for the two samples, and o* = (m+n)(m+n-+1)/12. 


Efficiency of the Wilcoxon Rank-Sum Test 
When the distributions being sampled are both normal with o, = o2 and therefore have the same 
shapes and spreads, either the pooled t test or the Wilcoxon test can be used. (The two-sample ¢ test 
assumes normality but not equal standard deviations, so assumptions underlying its use are more 
restrictive in one sense and less in another than those for Wilcoxon’s test.) In this situation, the pooled 
t test is best among all possible tests in the sense of maximizing power for any fixed «. However, an 
investigator can never be absolutely certain that underlying assumptions are satisfied. It is therefore 
relevant to ask (1) how much is lost by using Wilcoxon’s test rather than the pooled ¢ test when the 
distributions are normal with equal variances and (2) how W compares to T in nonnormal situations. 
The notion of asymptotic relative efficiency was discussed in the previous section in connection 
with the one-sample ¢ test and Wilcoxon signed-rank test. The results for the two-sample tests are the 
same as those for the one-sample tests. When normality and equal variances both hold, the rank-sum 
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test is approximately 95% as efficient as the pooled f test in large samples. That is, the ¢ test will give 
the same error probabilities as the Wilcoxon test using slightly smaller sample sizes. On the other 
hand, the Wilcoxon test will always be at least 86% as efficient as the pooled ¢ test and may be much 
more efficient if the underlying distributions are very nonnormal, especially with heavy tails. The 
comparison of the Wilcoxon test with the two-sample (unpooled) f test is less clear-cut. The ¢ test is 
not known to be the best test in any sense, so it seems safe to conclude that as long as the population 
distributions have similar shapes and spreads, the behavior of the Wilcoxon test should compare quite 
favorably to the two-sample ¢ test. 

Lastly, we note that power calculations for the Wilcoxon test are quite difficult. This is because the 
distribution of W when Hp is false depends not only on f; — fy but also on the shape of the two 
distributions. For most underlying distributions, the nonnull distribution of W is virtually intractable. 
This is why statisticians developed asymptotic relative efficiency as a means of comparing tests. With 
the capabilities of modern-day computer software, another approach to power calculations is to carry 
out a simulation experiment. 


The Wilcoxon Rank-Sum Interval 

Similar to the signed-rank interval of Section 14.2, aCI for uw; — fy based on the Wilcoxon rank-sum test 
is obtained by determining, for fixed x;’s and y;’s, the set of all Ap values for which Ho: 4, — Hy = Apo is 
not rejected. This is easiest to do if we use the Mann—Whitney U statistic (14.4), according to which Ho 
should be rejected if the number of (x; — y,;)’s > Ap is either too small or too large. 

This, in turn, suggests that we compute x; — y, for each i and j and order these mn differences from 
smallest to largest. Then if the null value Ap is neither smaller than most of the differences nor larger 
than most, Ho: 4) — fy = Ao is not rejected. Varying Ag now shows that a CI for 4, — p> will have as 
its lower endpoint one of the ordered (x; — y,)’s, and similarly for the upper endpoint. 


PROPOSITION Let x1, ..., X%, and y,, ..., y, be the observed values in two independent 
samples from continuous distributions that differ only in location (and not in 
shape or spread). With d;, =x; — y,; and the ordered differences denoted by 
dijays G2)» «++» Gin, the general form of a 100(1 — «)% CI for py — pg is 


(diom-24: 1) dij(c)) (14.5) 


where c is the critical value for the two-tailed level « Wilcoxon rank-sum test. 


Notice that the form of the Wilcoxon rank-sum interval (14.5) is very similar to the Wilcoxon signed- 
rank interval (14.2); (14.2) uses pairwise averages from a single sample, whereas (14.5) uses pairwise 
differences from two samples. Appendix Table A.14 gives values of c for selected values of m and 
n. In R, the wilcox.test function applied to vectors x and y containing the sample data will 
return this signed-rank interval if the user includes the option conf.int = T. 


Example 14.9 The article “Some Mechanical Properties of Impregnated Bark Board” (Forest 
Products J.) reports the following data on maximum crushing strength (psi) for a sample of epoxy- 
impregnated bark board and for a sample of bark board impregnated with another polymer: 


Epoxy (x’s) 10,860 11,120 11,340 12,130 14,380 13,070 
Other (y’s) 4590 4850 6510 5640 6390 


Let’s obtain a 95% CI for the true average difference in crushing strength between the epoxy- 
impregnated board and the other type of board. 
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From Appendix Table A.14, since the smaller sample size is 5 and the larger sample size is 
6, c = 26 for a confidence level of approximately 95%, and mn — c + 1 = (5)(6) — 264+ 1=5. 
All 30 djs appear in Table 14.4. The five smallest dj’s are djjc1) = 4350, 4470, 4610, 4730, and 
dis) = 4830; and the five largest dj's are (in descending order) 9790, 9530, 8740, 8480, and 8220. 
Thus the CI is (djs), di2e)) = (4830, 8220). 


Table 14.4 Differences (d,;) for the rank-sum interval in Example 14.9 


y; 
4590 4850 564 6390 6510 

10,860 6270 6010 5220 4470 4350 

11,120 6530 6270 5480 4730 4610 

X; 11,340 6750 6490 5700 4950 4830 
12,130 7540 7280 6490 5740 5620 

13,070 8480 8220 7430 6680 6560 

14,380 9790 9530 8740 7990 7870 


B 
When m and n are both large, the aforementioned normal approximation can be used to derive a 
large-sample approximation for the value c in interval (14.5). The result is 


mn(m+n-+ 1) 


mn mn 
Cn — 
ge 12 


As with the signed-rank interval, the rank-sum interval (14.5) is quite efficient with respect to the 
t interval; in large samples, (14.5) will tend to be only a bit longer than the ¢ interval when the 
underlying populations are normal and may be considerably shorter than the ¢ interval if the 
underlying populations have heavier tails than do normal populations. And once again, the actual 
confidence level for the ¢ interval may be quite different from the nominal level in the presence of 
substantial nonnormality. 


Exercises: Section 14.3 (25-36) 


different average alcohol content than those 
brewed in the USA? Use the Wilcoxon 
rank-sum test at « = .05. 


25. In an experiment to compare the bond 
strength of two different adhesives, each 
adhesive was used in five bondings of two 
surfaces, and the force necessary to separate 
the surfaces was determined for each 
bonding. For adhesive 1, the resulting val- 
ues were 229, 286, 245, 299, and 250, 


German 5.00 4.90 3.80 4.82 4.80 5.44 6.60 
Domestic 4.85 5.04 4.20 4.10 4.50 4.70 4.30 5.50 


27. A modification has been made to one 


26. 


whereas the adhesive 2 observations were 
213, 179, 163, 247, and 225. Let 4; denote 
the true average bond strength of adhesive 
type i. Use the Wilcoxon rank-sum test at 
level .05 to test Ho: Wy = My versus 
Ai: Hy > My. 

The accompanying data shows the alcohol 
content (percent) for random samples of 7 
German beers and 8 domestic beers. Does 
the data suggest that German beers have a 


assembly line for a particular automobile 
chassis. Because the modification involves 
extra cost, it will be implemented 
throughout all lines only if sample data 
strongly indicates that the modification has 
decreased true average assembly time by 
more than 1 h. Assuming that the assembly 
time distributions differ only with respect to 
location if at all, use the Wilcoxon rank- 
sum test at level .05 on the accompanying 


Original process 
Modified process 5.5 4.0 3.8 6.0 5.8 4.9 7.0 5.7 


Control —2.6 —2.2 -—2.1 —-1.8 -14 -1.1 —-0.7 


29. Reconsider 


data (also in hours) to test the appropriate 
hypotheses. 


8.6 5.1 45 5.4 6.3 6.6 5.7 8.5 


28. Can video games improve balance among 


the elderly? The article “The Effect of Vir- 
tual Reality Gaming on Dynamic Balance in 
Older Adults” (Age and Ageing 2012: 549- 
552) reported an experiment in which 34 
senior citizens were randomly assigned to 
one of two groups: (1) 16 who engaged in a 
six-week exercise regimen using the Wii Fit 
Balance Board (WBB) and (2) 18 who were 
told not to vary their daily physical activity 
during that interval. The accompanying data 
on improvement in 8-foot-up-and-go time 
(sec), a standard test of agility and balance, 
is consistent with information in the article. 
Test whether the true average improve- 
ment is greater using the WBB than under 
control conditions at the .05 significance 
level. 


-19 -08 O11 O05 06 O07 O8 0.9 

1d 12 15 20 21 2.7 32 3.7 
0.6 
04 #10 13 2.3 


-0.3 -0.1 0.0 0.3 
24 45 


the situation described in 
Exercise 110 of Chapter 10 and the fol- 
lowing Minitab output (the Greek letter eta 
is used to denote a median). 


Mann-Whitney Confidence Interval and 


Test 
good N=8 Median = 0.540 
poor N=8 Median = 2.400 


Point estima 
95.9% CI for 
(-3.160 -—0.409) W= 41.0 
ETA2 versus 


Test of 


significant at 


te for ] 


ETAL — 


ETAL = 


0.002 


7 


ETAL — ETA2 is -—1.155 
ETA2 is 


ETAI < 


ETA2 is 


a. Verify that Minitab’s test statistic value 
is correct. 

b. Carry out an appropriate test of hypothe- 
ses using a significance level of .O1. 


30. The article “Opioid Use and Storage Pat- 
terns by Patients after Hospital Discharge 


31. 


32. 
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following Surgery” (PLoS ONE 2016) 
reported a study of 30 women who just had 
Caesarian sections. The women were clas- 
sified into two groups: 14 with a high need 
for pain medicine post-surgery and 16 with 
a low need. The total oral morphine 
equivalent prescribed at discharge was 
determined for each woman, and _ the 
resulting Wilcoxon rank-sum test statistic 
was W = 249.5. (The .5 comes from ties in 
the data, but that won’t affect the P-value 
much.) Test the hypotheses Ho: [4y = Lb 
versus H,: [ly # [ly at the .05 significance 
level. Does it appear that physicians pre- 
scribe opioids in proportion to patients’ 
pain-control needs? 

The article “Mutational Landscape Deter- 
mines Sensitivity to PD-1 Blockade in 
Non-Small Cell Lung Cancer” (Science, 3 
April 2015) described a study of 16 cancer 
patients taking the drug Keytruda. For each 
patient, the number of nonsynonymous 
mutations per tumor was determined; 
higher numbers indicate better drug effec- 
tiveness. The data is separated into patients 
that showed a durable clinical benefit 
(partial or stable response _last- 
ing >6 months) and those with no durable 
benefit. Use the methods of this section to 
determine whether patients experiencing 
durable clinical benefit tend to have a 
higher average number of nonsynonymous 
mutations than those with no durable ben- 
efit (use « = .05). 


Durable 170 228 300 302 315 490 774 

benefit 

No durable 11 28 46 115 148 161 180 300 625 
benefit 


The Wilcoxon rank-sum statistic can be 
represented as W=R,+Ro+---4+Rn, 
where R; is the rank of X; — Ap among all 
m+n such differences. When Hp is true, 
each R; is equally likely to be one of the 
first m + n positive integers; that is, R; has a 
discrete uniform distribution on the values 
1, 2,3, ...,m+n. 


14.3. Two-Sample Rank-Based Inference 


a. Determine the mean value of each R; 
when Ho is true and then show that the 
mean value of W is m(m+n + 1)/2. 
{Hint: The sum of the first k positive 
integers is k(k + 1)/2.] 

b. The variance of each R; is easily deter- 
mined. However, the R;’s are not inde- 
pendent random variables because, for 
example, if m =n = 10 and we are told 
that R; = 5, then R» must be one of the 
other 19 integers between 1 and 20. 
However, if a and D are any two distinct 
positive integers between 1 and 
m +n inclusive, it follows that P(R; = 
a and R; = b) = 1/[(m+n)(m+n-—1)| 
since two integers are being sampled 
without replacement from among 1, 2, 
... m+n. Use this fact to show that 
Cov(Ri,R;) = —(m+n+1)/12, and 
then show that the variance of W is 
mn(m+n-+1)/12. 


33. The article “Controlled Clinical Trial of 


Canine Therapy Versus Usual Care to 
Reduce Patient Anxiety in the Emergency 
Department” (PLoS ONE 2019) reported on 
an experiment in which 80 adult hospital 
patients were randomly assigned to either 
15 min with a certified therapy dog 
(m= 40) or usual care (m = 40). Each 
patient’s change in self-reported pain, 
depression, and anxiety (pre-treatment 
minus post treatment) was recorded. The 
researchers employed a rank-sum test to 


34. 


35. 


36. 
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compare the two treatment groups on each 
of these three outcomes; the resulting test 
statistic values appear below. 


Pain 
1475 


Change in: Depression 


1316 


Anxiety 
w= 1171 
Test the hypotheses Ho: p14, — My = O against 
A: Ly — fo <0 (1 = dog therapy treat- 
ment, 2 =control) for each of the three 
response variables at the .01 significance 
level. What can be said about the chance of 
committing at least one type I error among 
the three tests? 


Refer to Exercise 32. Sort the ranks of the 

X;’s, so that R} <Ro< --- <Ry. 

a. In terms of the R’s, how many of the Y;,’s 
are less than the smallest X;? Less than 
the second-smallest X;? 

b. When Ao = 0, the Mann-Whitney test 
statistic is U =the number of (Xj, Y;) 
pairs for which X; > Y;. Use part (a) to 
express U as a sum, then show this sum 
is equal to W — m(m-+ 1) /2. 

c. Use the mean and variance of W to 
determine E(U) and V(U). 


Obtain the 90% rank-sum CI for 4 — bo 
using the data in Exercise 25. 

Obtain a 95% CI for uw; — "2 using the data 
in Exercise 27. Is your interval consistent 
with the result of the hypothesis test in that 
exercise? 


14.4 Nonparametric ANOVA 


The analysis of variance (ANOVA) procedures in Chapter 11 for comparing J population or treatment 
means assumed that every population/treatment distribution is normal with the same standard devi- 
ation, so that the only potential difference is their means ,1,,...,4;. Here we present methods for 
testing equality of the ju;’s that apply to a broader class of population distributions. 


The Kruskal-Wallis Test 

The Kruskal—Wallis test extends the Wilcoxon rank-sum test of the previous section to the case of 
three or more populations or treatments. The J population/treatment distributions under consideration 
are assumed to be continuous, have the same shape and spread, but possibly different means. More 
formally, with f; denoting the pdf of the ith distribution, we assume that f\(x — “,) =fo(x — bh) = 
-++ = f;(x — u;), so that the distributions differ by (at most) a shift. Following the notation of Chapter 11, 
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let J; = the ith sample size, n = )> J; = the total number of observations in the data set, and X;; = the 
jth observation in the ith sample 7 = 1,...,J;;i = 1,...,). As in the rank-sum test, we replace each 
observation X; with its rank, Rj, among all n observations. So, the smallest observation across all 
samples receives rank 1, the next-smallest rank 2, and so on through n. 


Example 14.10 Diabetes and its associated health issues among children are of ever-increasing 
concern worldwide. The article “Clinical and Metabolic Characteristics among Mexican Children 
with Different Types of Diabetes Mellitus” (PLoS ONE, Dec. 16, 2016) reported on an in-depth study 
of children with one of four types of diabetes: (1) type 1 autoimmune, (2) type 2, (3) type 1 idiopathic 
(the most common type in children), and (4) what the researchers called “type 1.5.” (The last category 
describes children who exhibit characteristics consistent with both type | and type 2; the authors note 
that the American Diabetes Association does not recognize such a category.) 

To illustrate the Kruskal-Wallis method, presented here is a subset of the triglyceride measure- 
ments (mmol/L) on these children, along with their associated ranks in brackets. 


Diabetes group Triglyceride level (mmol/L) 
1 1.06 [5] 1.94 [11] 1.07 [6] 
2 1.30 [9] 2.08 [12] 2.15 [13] 
3 1.08 [7] 0.45 [1] 0.85 [2] 1.13 [8] 2.28 [14] 0.87 [3] 
4 1.39 [10] 0.89 [4] 


In this example, 7= 4 (four populations), J; = Jp = 3, Jz; =6, Jg = 2, and n= 14. The first 
observation is x}; = 1.06 with associated rank r,,; = 5. a 


When Ho: ; =--- = pM, is true, the J population/treatment distributions are identical, and so the 
X;;s form a random sample from a single population distribution. It follows that each R;; is uniformly 
distributed on the integers 1, 2, ..., n, so that E(Ry) = (n+ 1)/2 for every i and j when Hp is true. If 
we let R;. denote the mean of the ranks in the ith sample, then 


B(R:) =>) B(Ry) = 


Moreover, regardless of whether Ho is true, the “grand mean” of all n ranks is (n + 1)/2, the average 
of the first n positive integers. 

Similar to the treatment sum of squares SSTr from one-way ANOVA, the Kruskal—Wallis test 
statistic, denoted by H, quantifies “between-groups” variability by measuring how much the R;.’s 
differ from the grand mean: 


I 


— 2 — 2 Ae eg 5 12 fe natty 
aay er are pe R..) =e 5 ) (14.6) 


i=l 


When the null hypothesis is true, each R;. will be close to its expected value (the grand mean), 
whereas when Hp is false certain samples should have an overabundance of high ranks and others too 
many low ranks, resulting in a larger value of H. Even for small J;’s, the exact null distribution of H in 
Expression (14.6) is unwieldy. Thankfully, when the sample sizes are not too small, the approximate 
sampling distribution of H under the null hypothesis is known. 
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KRUSKAL-WALLIS TEST Null hypothesis: Ho: wy, = --- = by 
Alternative hypothesis: not all y;’s are equal 


va n+1\? 
Test statistic value: h = Jil 7. — : 
est statistic value ree D (* 5 ) 
where 7;. denotes the average rank within the ith sample. 
When Hp is true, H has approximately a chi-squared distribution 
with J — 1 df. This approximation is reasonable provided that 
all J; > 5. 


It should not be surprising that H has an approximate Ye distribution. By the Central Limit The- 
orem, the averages R;. should be approximately normal, and so H is similar to the sum of squares of 
I standardized normal rvs. However, as in the distribution of sample variance s, these rvs are not 
independent—the sum of all ranks is fixed—and that one constraint costs one degree of freedom. 


Example 14.11 (Example 14.10 continued) Though the sample sizes in our illustrative example are a 
bit too small to meet the requirements of the Kruskal-Wallis test, we proceed with the rest of the test 
procedure on this reduced data set. The rank averages are 7). = (5+11+6)/3 = 7.33, %. = 11.33, 
73. = 5.83, and 74. = 7. The grand mean of all 14 ranks is (n + 1)/2 = 7.5, and the test statistic value is 


12 3 i % 
h=——~__J “(7 — 75Y = = B(7.33 — 7.5)? + +++ $2(7 — 7.5)"] = 3.50 
laa) 2 I aig pee y] 


Comparing this to the critical value 0054-1 = 7.815, we would not reject Hp at the .05 significance 


level. Equivalently, the P-value is PCH > 3.50 when H ~ y3) = .321, again indicating no reason to 
reject Ho. 

Details of the Kruskal—Wallis test for the full sample appear below. The small test statistic of 5.78 
and relatively large P-value of .123 indicate that the mean triglyceride levels for these four popu- 
lations of diabetic children are not statistically significantly different. 


Diabetes group Jj Tj. 
1 25 64.0 h=5.78 
2 31 80.8 P-value = .123 
3, 63 62.0 
4 17 76.6 
n= 136 Fr. = 68.5 


Expression (14.6) is sometimes written in other forms. For example, with W; denoting the sum of 
the ranks for the ith sample (analogous to the Wilcoxon statistic W), it can be shown that 


ie 4 n+1\? 12 
H= pode = 
Campa (w ” 2 ) n(n+1)4 


i=1 i 


77 3@+1) (14.7) 


I 2 
=1 
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The quantity J;(n+ 1)/2 in the middle expression of (14.7) is the expected rank sum for the ith 
sample when Ho is true. The far-right expression in (14.7) is computationally quicker than (14.6). 
Alternatively, some software packages report the Kruskal-Wallis statistic in the form H = )>Z?, 
where 

Z, _ —(n+1)/2 
5 a/VIj 
and o* = n(n+1)/12. 

In Chapter 11, we emphasized the need for two measures of variability, SSTr and SSE, with the 
latter measuring the variability within each sample. Why is SSE not required here? The fundamental 
ANOVA identity SST = SSTr + SSE still applies to the R,;’s, but because the ranks are just a re- 
arrangement of the integers 1 through n, the total sum of squares SST depends only on n and not on 
the raw data (Exercise 41). Thus, once we know n and SSTr, the other two sums of squares are 
completely determined. 


Nonparametric ANOVA for a Randomized Block Design 

The Kruskal-Wallis test is applicable to data resulting from a completely randomized design (in- 
dependent random samples from J population or treatment distributions). Suppose instead that we 
have data from a randomized block experiment and wish to test the null hypothesis that the true 
population/treatment means are equal (i.e., “no treatment effect”). To test Ho: uy = --- = my in this 
situation, the observations within each block are ranked from | to J, and then the average rank 7;. is 
computed for each of the J treatments. 


Example 14.12 The article “Modeling Cycle Times in Production Planning Models for Wafer 
Fabrication” (IEEE Trans. on Semiconductor Manuf. 2016: 153-167) reports on a study to compare 
three different linear programming models used in the simulation of factory processes: allocated 
cleaning function (ACF), fractional lead time (FLT), and simple rounding down (SRD). Simulations 
were run under five different demand representations, and the profit from the (simulated) manufacture 
of a particular product was determined. 

In this study, there are 7 = 3 treatments being compared using J =5 blocks. The profit data 
presented in the article, along with the rank of each observation within its block and the rank average 
for each treatment, appears below. 


Demand representation (block) 


LP model 1 2 3 4 5 F, 
ACF $44,379 [1] $69,465 [3] $18,317 [2] $69,981 [3] $32,354 [3] 2.4 
FLT $47,825 [3] $43,354 [1] $17,512 [1] $48,707 [2] $30,993 [2] 1.8 
SRD $47,446 [2] $53,393 [2] $27,554 [3] $45,435 [1] $25,662 [1] 1.8 


Within each block, the average of the ranks 1, ..., J is simply (7 + 1)/2, and hence this is also the 
grand mean. If the null hypothesis of “no treatment effect” is true, then all J! possible arrangements of 
the ranks within each block are equally likely, so each rank Rj is uniformly distributed on {1, ..., 7} 
and has expected value (J + 1)/2. The nonparametric method for analyzing this type of data, known as 
the Friedman test (developed by the Nobel Prize-winning economist Milton Friedman) relies on the 
following statistic: 
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Fr= A= 
r SS id S- > 


= = WF weaf= FHA\7 
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i=l j=l 


As with the Kruskal-Wallis test, the Friedman test rejects Hy when the computed value of the test 
statistic is too large (an upper-tailed test). For even small-to-moderate values of J (the number of 
blocks), the test statistic Fr in (14.8) has approximately a chi-squared distribution with J — 1 df. 


Example 14.13 (Example 14.12 continued) Applying Expression (14.8) to the profit data, the 


observed value of the test statistic is 


12(5) 
3(3 +1) 


fr= [(2.4 2)? + (1.8 2)? + (18 -2)"] = 12 


Since this is far less than the critical value rane = 4.605, the null hypothesis of equal mean profit 
for all three linear programming models is certainly not rejected. H 


The Friedman test is used frequently to analyze “expert ranking” data (see, for example, Exercise 
46). If each of J individuals ranks 7 items on some criterion, then the data is naturally of the type for 
which the Friedman test was devised. Each ranker acts as a block, and the test seeks out significant 
differences in the mean rank received by each of the / items. 

As with the Kruskal—Wallis test, the total sum of squares for the Friedman test is fixed (in fact, it is 
a simple function of J and J; see Exercise 47(b)). In randomized block ANOVA in Chapter 11, the 
block sum of squares SSB gave an indication of whether blocking accounted for a significant amount 
of the total variation in the response values. In the data of Example 14.12, the blocking variable of 
demand representation clearly has an impact—for instance, the profits in the 7 = 3 block (middle 
column) are far lower than in other blocks. Unfortunately, SSB for Friedman’s test is identically 0 
(Exercise 47(a)), and so the effectiveness of blocking remains unquantified. 


Exercises: Section 14.4 (37-48) 


37. 


The article cited in Example 14.10 also 
reported the fasting C-peptide levels (FCP, 
nmol/L) for the children in the study. 


Diving Petrel Populations Reveal an 
Undescribed and Highly Endangered Spe- 
cies from New Zealand” (PLoS ONE, June 
27, 2018) reports the possible discovery of 


Diabetes group Jj 7; a new species of P. georgicus, distinct 
1 26 72.8 from the birds of the same name found in 
° 2 | the South Atlantic and South Indian 
4 7 104.6 Oceans. The table below summarizes 


(Sample sizes are different here than in 
Example 14.10 due to missing data in the 
latter.) Use a Kruskal—Wallis test (as the 
article’s authors did) to determine whether 


information on the bill length (mm) of 
birds sampled for the study; bill length 
distributions are skewed, so a nonpara- 
metric method is appropriate. 


true average FCP levels differ across these Bird origin Ji ry. 
four populations of diabetic children. S. Atlantic 22 121.0 
ies S. Indian 38 109.3 
38. The article “Analyses of Phenotypic Dif- New Zealand 126 83.9 


ferentiations among South Georgian 
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39. 


40. 


41. 


Test at the .01 significance level to see 
whether the mean bird length differs across 
these three geographic groups of P. georgi- 
[The researchers performed over a 
dozen similar tests and found many features 
in which the New Zealand petrels are 
starkly different from the others. ] 


cus. 


The following data on the fracture load 
(KN) of Plexiglas at three different loading 
point locations appeared in the article “Eval- 
uating Fracture Behavior of Brittle Polymeric 
Materials Using an IASCB Specimen” (J. of 
Engr. Manuf. 2013: 133-140). 


Distance Fracture load 

31.2 mm 4.78 441 4.91 5.06 
36.0 mm 3.47 3.85 ATT 3.63 
42.0 mm 2.62 2.99 3.39 2.86 


Use a rank-based method to determine 
whether loading point distance affects true 
mean fracture load, at the « = .01 level. 


Dental composites used to fill cavities will 
decay over time. The article “In vitro Aging 
Behavior of Dental Composites Considering 
the Influence of Filler Content, Storage 
Media and Incubation Time” (PLoS ONE, 
April 9, 2018) reported on a study to measure 
the hardness (MPa) of a particular type of 
resin after 14 days stored in artificial saliva, 
lactic acid, citric acid, or 40% ethanol. 


Saliva 542.18 508.31 473.44 514.33 488.41 
Lactic 478.99 501.15 488.97 463.68 471.14 
Citric 427.97 388.59 378.01 341.61 395.12 
Ethanol 482.96 451.48 436.69 424.42 465.64 
Saliva 477.46 501.71 513.65 471.46 421.90 
Lactic 568.14 494.15 494.99 483.89 520.33 
Citric 433.59 353.03 344.90 387.09 501.81 
Ethanol 387.59 322.55 277.84 367.36 385.75 


Use a Kruskal—Wallis test with significance 
level .05 to determine whether true average 
hardness differs by liquid medium. 


Let SST denote the total sum of squares for 
the Kruskal—Wallis test: SST = 
Soo (Ry —R.)°. Verify that SST = 
n(n? — 1/12. [Hint: The Rj’s are a 
re-arrangement of the integers | through 


42. 


43. 
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n. Use the formulas for the sum and sum of 
squares of the first n positive integers. ] 


Show that the two formulas for the Krus- 
kal—Wallis test statistic in Expression (14.7) 
are identical and both equal the original 
formula for H. 


Many people suffer back or neck pain due 
to bulging discs in the lumbar or cervical 
spine, but the thoracic spine (the section in- 
between) is less well-studied. The article 
“Kinematic analysis of the space available 
for cord and disc bulging of the thoracic 
spine using kinematic magnetic resonance 
imaging (kKMRI)” (The Spine J. 2018: 
1122-1127) describes a study using kMRI 
to measure disc bulge (mm) in neutral, 
flexion, and extension positions. 


a. Suppose measurements were taken on 
just 6 subjects. The following bulge 
measurements at the T11-T12 disc 
(bottom of the thoracic spine) are con- 
sistent with information in the article: 


Subject 
Position 1 2 3 4 3 6 
Neutral 1.28 0.88 0.69 1.52 0.83 2.58 
Flexion 1.29 0.76 043 2.11 1.07 2.18 
Extension 1.51 1.12 0.23 1.54 0.20 1.67 


Convert these measurements into within- 
block ranks, and use the Friedman test to 
determine if the true average disc bulge 
at T11—-T12 varies by position. 


b. The study actually involved 105 sub- 
jects, each serving as her/his own block. 
The sum of the ranks for the three posi- 
tions were neutral = 219, flexion = 222, 
extension = 189. Use these to perform 
the Friedman test, and report your con- 
clusion at the .05 significance level. 


c. Similar measurements were also taken 
on all 105 subjects at the T4—-T5 disc 
(top of the thoracic spine); rank sums 
consistent with the article are 207, 221, 
and 202. Repeat the test of part (b) for 
the T4—-T5 disc. 
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44. Is that Yelp review real or fake? The article 


45. 


“A Framework for Fake Review Detection 
in Online Consumer Electronics Retailers” 
(Information Processing and Management 
2019: 1234-1244) tested five different 
classification algorithms on a large corpus 
of Yelp reviews from New York, Los 
Angeles, Miami, and San Francisco whose 
authenticity (real or fake) was known. 
Since review styles differ greatly by city, 
the researchers used city as a blocking 
variable. The table below shows the F, 
score, a standard measure of classification 
accuracy, for each algorithm-city pairing. 
(F; scores range from 0 to 1, with higher 
values implying better accuracy.) 


Algorithm NYC LA Miami SF 
Logistic regression ao aa 78 ai 
Decision tree 81 .74 81 81 
Random forest 82 78 .80 .80 
Gaussian Naive Bayes 72 69 71 69 
AdaBoost .83 79 .82 .82 


Test the null hypothesis that the five algo- 
rithms are equally accurate in classifying 
real and fake Yelp reviews at the .10 sig- 
nificance level. 


Image segmentation is a key tool in com- 
puter vision (i.e., helping computers “see” 
the meaning in pictures). The article “Effi- 
cient Quantum Inspired Meta-Heuristics for 
Multi-Level True Colour Image Thresh- 
olding” (Applied Soft Computing 2017: 
472-513) reported a study to compare 10 
image segmentation algorithms—six con- 
ventional, four inspired by quantum com- 
puting. Each algorithm was applied to 10 
different images, from an elephant to Mono 
Lake to the Mona Lisa; the images serve as 
blocks in this study. Kapur’s method, an 
entropy measure for image segmentation 
tools, was applied to each (algorithm, 
image) pair; lower numbers are better. The 
article reports the following rank averages 
for the 10 algorithms. 
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GA SA PSO 
8.30 9.10 8.90 
QIGAMLTCI QISAMLTCI QIPSOMLTCI 
2.15 3.65 1.85 
DE BSA CoDE 
6.60 5.90 6.20 
QIDEMLTCI 
2.35 


Does the data indicate that the 10 algorithms 
are not equally effective at minimizing 
Kapur’s entropy measure? Test at the .01 
significance level. What do the rank aver- 
ages suggest about quantum-inspired versus 
conventional image segmentation methods? 


Sustainability in corporate culture is typi- 
cally described as having three dimensions: 
economic, environmental, and social. The 
article “Development of Indicators for the 
Social Dimension of Sustainability in a 
U.S. Business Context” (J. of Cleaner 
Production 2019: 687-697) reported on the 
development of a survey instrument for the 
least-studied of these “three pillars,” social 
sustainability. The researchers had 26 
experts take the final version of the survey. 
In each survey section, participants were 
asked to rank a set of possible metrics from 
most important to least important. 


a. The four metrics listed below were cate- 
gorized as “public actualization needs.” 
Use the mean ranks to test at the .05 sig- 
nificance level whether experts systemat- 
ically prioritize some of these metrics over 
others with respect to social sustainability. 


Metric Mean 
rank 
Ratio of public contributions (e.g., 1.80 
donations) to market capitalization 
% of public that says company is making the 2.50 
world a better place 
% of employees that contribute service for 2.85 
the public good 
Ratio of minority management to minority 2.85 


workforce 
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b. Repeat part (a) using the following survey 
metrics for “public safety and security 
needs.” 
Metric Mean rank 
% of employees receiving human rights 1.69 
policy/procedure training 
% of investment agreements that include 2.04 
human rights clauses 
% of company sales that support 2.27 


human/environmental health/safety 


47. a. In Chapter 11, the block sum of squares 
was defined by SSB = 1 )) (Xj — x). 
Replacing X’s with R’s in this expres- 
sion, explain why SSB =0 for the 
Friedman test. 

b. The total sum of squares for ranks Rj 
is SST = > 33 (Rj —R.)’. Determine 
SST for the Friedman test. [Hint: Your 
answer should depend only on / and J.] 


48. In the context of a randomized block 
experiment, let W; denote the sample rank 
sum associated with the ith population/ 
treatment. Show that the Friedman’s test 
statistic can be re-expressed as 
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49. In a_ study described in the article 
“Hyaluronan Impairs Vascular Function 
and Drug Delivery in a Mouse Model of 
Pancreatic Cancer” (Gut 2013: 112-120), 
hyaluronan was depleted from mice using 
either PEGPH20 or an equal dose of a 
standard treatment. The vessel patency (%) 
for each mouse was recorded. 


PEGPH20 62 68 70 76 
Standard 24 29 35 41 
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Use a rank-sum test (as did the article’s 
authors) to determine if PEGPH20 yields 
higher vessel patency than the standard 
treatment at the .05 level. Would you also 
reject the null hypothesis at the .01 level? 
Comment on these results in light of the fact 
that every PEGPH20 measurement is higher 
than every standard-treatment measurement. 


The article “Long Telomeres are Associ- 
ated with Clonality in Wild Populations of 

. C. tenuispina”’ (Heredity 2015: 437- 
443) reported the following telomere mea- 
surements for (1) a normal arm and (2) a 
regenerating arm of 12 Mediterranean 
starfish. 


Normal 11.246 11.493 11.136 11.120 10.928 11.556 
Regen. 11.142 11.047 11.004 11.506 11.067 10.875 


Normal 11.313 11.164 10.878 12.680 11.937 11.172 
Regen. 11.484 11.517 10.756 10.973 11.078 11.182 


It is theorized that such measurements 
should be smaller for regenerating arms, 
because the above values are inversely rela- 
ted to telomere length and longer telomeres 
are associated with younger tissue. Use a 
nonparametric test to see if the data supports 
this theory at the .05 significance level. 


Physicians use a variety of quantitative sen- 
sory testing (QST) tools to assess pain in 
patients, but there is concern about the con- 
sistency of such tools. The article “Test- 
Retest Reliability of [QST] in Knee 
Osteoarthritis and Healthy Participants” 
(Osteoarthritis and Cartilage 2011: 655- 
658) describes a study in which participants’ 
responses to various stimuli were measured 
and then re-measured one week later. For 
example, pressure was applied to each sub- 
ject’s knee, and the level (kPa) at which the 
patient first experienced pain was recorded. 


a. Pressure pain measurements were taken 
twice on each of 50 patients with 
osteoarthritis in the examined knee. The 
Wilcoxon signed-rank test statistic value 
computed from this paired data was 
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s, = 616. Use a two-sided, large-sample 
test to assess the reliability of the sensory 
test at the .10 significance level. 

b. The same measurements were made of 
50 healthy patients, and the resulting 
test statistic value was s, = 814. Carry 
out the test indicated in part (a) using 
this information. Does the pain pressure 
test appear reliable for the population of 
healthy patients? 


Adding mileage information to roadside 
amenity signs (“Motel 6 in 1.2 miles”) can 
be helpful but might also increase accidents 
as drivers strain to read the detailed infor- 
mation at a distance. The article “Evalua- 
tion of Adding Distance Information to 
Freeway-Specific Service (Logo) Signs” 
(Transp. Engr. 2011: 782-788) provides 
the following information on number of 
crashes per year before and after mileage 
information was added to signs at six 
locations in Virginia. 


Location 1 2 3 4 3 6 
Before 15 26 66 115 62 64 
After 16 24 42 80 78 73 


a. Use a one-sample sign test to determine 
whether more accidents occur after 
mileage information is added to road- 
side amenity signs. Be sure to state the 
hypotheses, and indicate what assump- 
tions are required. 

b. Use a signed-rank test to determine 
whether more accidents tend to occur 
after mileage information is added to 
roadside amenity signs. What are the 
hypotheses now, and what additional 
assumptions are required? 


The accompanying observations on axial 
stiffness index resulted from a study of 
metal-plate connected trusses in which five 
different plate lengths—4 in., 6 in., 8 in., 10 
in., and 12 in.—were used (“Modeling 
Joints Made with Light-Gauge Metal Con- 
nector Plates,” Forest Products J. 1979: 
39-44). 
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i=1 (4 in.): 309.2 309.7 311.0 316.8 
326.5 349.8 409.5 

i= 2 (6 in.): 331.0 347.2 348.9 361.0 
381.7 402.1 404.5 

i= 3 (8 in.): 351.0 357.1 366.2 367.3 
382.0 392.4 409.9 

i= 4 (10 in.): 346.7 362.6 384.2 410.6 
433.1 452.9 461.4 

i=5 (12 in.): 407.4 410.7 419.9 441.2 
441.8 465.8 473.4 


Use the Kruskal-Wallis test to decide at 
significance level .01 whether the true 
average axial stiffness index depends 
somehow on plate length. 


The article “Production of Gaseous Nitro- 
gen in Human Steady-State Conditions” 
(J. Appl. Physiol. 1972: 155-159) reports 
the following observations on the amount 
of nitrogen expired (in liters) under four 
dietary regimens: (1) fasting, (2) 23% pro- 
tein, (3) 32% protein, and (4) 67% protein. 
Use the Kruskal-Wallis test at level .05 to 
test equality of the corresponding 1,’s. 


1 4.079 4.859 3.540 5.047 3.298 
4.679 2.870 4.648 3.847 

2 4.368 5.668 3.752 5.848 3.802 
4.844 3.578 5.393 4.374 

3 4.169 5.709 4416 5.666 4.123 
5.059 4.403 4.496 4.688 

4 4.928 5.608 4.940 5.291 4.674 
5.038 4.905 5.208 4.806 


The article “Physiological Effects During 
Hypnotically Requested Emotions” (Psy- 
chosomatic Med. 1963: 334-343) reports 
the following data (x) on skin potential in 
millivolts when the emotions of fear, hap- 
piness, depression, and calmness were 
requested from each of eight subjects. 


Blocks (subjects) 


1 2 3 4 
Fear 23.1 57.6 10.5 23.6 
Happiness 22.7 53.2 9.7 19.6 
Depression 22.5 33.7 10.8 21.1 
Calmness 22.6 53.1 8.3 21.6 

5 6 7 8 
Fear 11.9 54.6 21.0 20.3 
Happiness 13.8 47.1 13.6 23:6 
Depression 13.7 39.2 V3.7 16.3 
Calmness 13.3 37.0 14.8 14.8 


888 


56 


57 


Use Friedman’s test to decide whether 
emotion has an effect on skin potential. 


. In an experiment to study the way in which 


different anesthetics affect plasma epinephr- 
ine concentration, ten dogs were selected, 
and concentration was measured while they 
were under the influence of the anesthetics 
isoflurane, halothane, and cyclopropane 
(“Sympathoadrenal and Hemodynamic 
Effects of Isoflurane, Halothane, and Cyclo- 
propane in Dogs,” Anesthesiology 1974: 
465-470). Test at level .05 to see whether 
there is an anesthetic effect on concentration. 


1 2 3 4 5 


Isoflurane 28 Si 
Halothane 30 39 .63 38 21 
Cyclopropane 1.07 1.35 .69 28 


6 7 8 9 10 


Isoflurane 36 32. 69 AT oo 
Halothane 88 39 51 32 42 
Cyclopropane 1.53 49 56 


. Suppose we wish to test 


the X and Y distributions are identical 
versus 

the X distribution is less spread out 
than the Y distribution 


Ho: 


A: 


The accompanying figure pictures X and 
Y distributions for which H, is true. The 
Wilcoxon rank-sum test is not appropriate 
in this situation because when H, is true as 
pictured, the Y’s will tend to be at the 
extreme ends of the combined sample (re- 
sulting in small and large Y ranks), so the 
sum of X ranks will result in a W value that 
is neither large nor small. 


X distribution 


Oe it 


ae Y distribution 


“Ranks” : 1 3 5.) Perk 6 4 2 


Consider modifying the procedure for 
assigning ranks as follows: After the com- 
bined sample of m+n observations is 
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ordered, the smallest observation is given 
rank 1, the largest observation is given rank 
2, the second smallest is given rank 3, the 
second largest is given rank 4, and so on. 
Then if H, is true as pictured, the X values 
will tend to be in the middle of the sample 
and thus receive large ranks. Let W’ denote 
the sum of the X ranks and consider 
rejecting Hy in favor of H, when w' > c. 
When Hp is true, every possible set of 
X ranks has the same probability, so W’ has 
the same distribution as does W when Ho is 
true. Thus c can be chosen from Appendix 
Table A.13 to yield a level « test. The 
accompanying data refers to medial muscle 
thickness for arterioles from the lungs of 
children who died from sudden infant death 
syndrome (x’s) and a control group of 
children (y’s). Carry out the test of Hp 
versus H, at level .05. 


SIDS 4.0 4.4 4.8 4.9 
Control 3.7 4.1 4.3 5.1 5.6 


[Note: Consult the Lehmann book in the 
bibliography for more information on this 
test, called the Siegel-Tukey test.] 


The ranking procedure described in the 
previous exercise is somewhat asymmetric, 
because the smallest observation receives 
rank | whereas the largest receives rank 2, 
and so on. Suppose both the smallest and 
the largest receive rank 1, the second 
smallest and second largest receive rank 2, 
and so on, and let W” be the sum of the 
X ranks. The null distribution of W" is not 
identical to the null distribution of W, so 
different tables are needed. Consider the 
case m=3, n=4. List all 35 possible 
orderings of the three X values among the 
seven observations (e.g., 1, 3, 7 or 4, 5, 6), 
assign ranks in the manner described, 
compute the value of W” for each possi- 
bility, and then tabulate the null distribution 
of W”. For the test that rejects if w” > c, 
what value of c prescribes approximately a 
level .10 test? [Note: This is the Ansari- 
Bradley test; for additional information, see 
the book by Hollander and Wolfe in the 
bibliography. ] 


®) 


Check for 
updates 


Introduction 

In this final chapter, we briefly introduce the Bayesian approach to parameter estimation. The standard 
frequentist view of inference is that the parameter of interest, 0, has a fixed but unknown value. 
Bayesians, however, regard @ as a random variable having a prior probability distribution that 
incorporates whatever is known about its value. Then to learn more about 0, a sample from the 
conditional distribution f(x|0) is obtained, and Bayes’ theorem is used to produce the posterior dis- 
tribution of @ given the data x, ..., x,. All Bayesian methods are based on this posterior distribution. 


15.1. Prior and Posterior Distributions 


Throughout this book, we have regarded parameters such as yw, o, p, and / as having an unknown but 
single, fixed value. This is often referred to as the classical or frequentist approach to statistical 
inference. However, there is a different paradigm, called subjective or Bayesian inference, in which an 
unknown parameter is assigned a distribution of possible values, analogous to a probability distri- 
bution. This distribution reflects all available information—past experience, intuition, common sense 
—about the value of the parameter prior to observing the data. For this reason, it is called the prior 
distribution of the parameter. 


DEFINITION A prior distribution for a parameter 0, denoted 2(0), is a probability distribution on 
the set of possible values for 0. In particular, if the possible values of the parameter 
@ form an interval J, then 2(@) is a pdf that must satisfy 


[r@a=i 


Similarly, if @ is potentially any value in a discrete set D, then (0) is a pmf that 
must satisfy 


S > n(0) =1 
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Example 15.1 Consider the parameter uw = the mean GPA of all students at your university. Since 
GPAs are always between 0.0 and 4.0, 4 must also lie in this interval. But common sense tells you 
that w is almost certainly not below 2.0, or very few people would graduate, and it would be likewise 
surprising to find w above 3.5. This “prior belief’ can be expressed mathematically as a prior 
distribution for w on the interval J = [0, 4]. If our best guess a priori is that uw ~ 2.5, then our prior 
distribution (uw) should be centered around 2.5. The variability of the prior distribution we select 
should reflect how sure we feel about our initial information. 

If we feel very sure that w is near 2.5, then we should select a prior distribution for w that has less 
variation around that value. On the other hand, if we are less certain, this can be reflected by a prior 
distribution with much greater variability. Figure 15.1 illustrates these two cases; both of the pdfs 
depicted are beta distributions with A = 0 and B = 4. 
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Figure 15.1 Two prior distributions for a parameter: a more diffuse prior (less certainty) 
and a more concentrated prior (more certainty) a 


The Posterior Distribution of a Parameter 


The key to Bayesian inference is having a mathematically rigorous way to combine the sample data 
with prior belief. Suppose we observe values x), ..., x, from a distribution depending on the unknown 
parameter @ for which we have selected some prior distribution. Then a Bayesian statistician wants to 
“update” her or his belief about the distribution of 0, taking into account both prior belief and the 
observed x;’s. Recall from Chapter 2 that Bayes’ theorem was used to obtain posterior probabilities of 
partitioning events A,,...,A, conditional on the occurrence of some other event B. The following 
definition relies on the analogous result for random variables. 


DEFINITION Suppose Xj, ..., X,, have joint pdf f(x;, ..., x,; 0) and the unknown parameter 0 has 
been assigned a continuous prior distribution 2(0). Then the posterior distribution 
of 0, given the observations X; = x1, ..., X, = Xp, iS 
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m(O)F (x1, sey Xn; 0) 
J ROS (x1, --5%n5 0) dO 


n(O|x1,---,Xn) = 


(15.1) 


The integral in the denominator of (15.1) insures that the posterior distribution 
is a valid probability density with respect to 0. 
If X,, ..., X, are discrete, the joint pdf is replaced by their joint pmf. 


Notice that constructing the posterior distribution of a parameter requires a specific probability model 
f1, -.-, Xn 9) for the observed data. In Example 15.1, it would not be enough to simply observe the 
GPAs of a random sample of n students; one must specify the underlying distribution, with mean w, 
from which those GPAs are drawn. 


Example 15.2 Emissions of subatomic particles from a radiation source are often modeled as a 
Poisson process. This implies that the time between successive emissions follows an exponential 
distribution. In practice, the parameter / of this distribution is typically unknown. If researchers 
believe a priori that the average time between emissions is about half a second, so 4 = 2, a prior 
distribution with a mean around 2 might be selected for 7. One example is the following gamma 
distribution, which has mean (and variance) of 2: 


n(A)=de*? A>O 


Notice that the gamma distribution has support equal to 0, co, which is also the set of possible values 
for the unknown parameter /. 

The times X,, ..., Xs between five particle emissions will be recorded; it is these variables that have 
an exponential distribution with the unknown parameter / (equivalently, mean 1/A). Because the X;,’s 
are also independent, their joint pdf is 


f(a1,-- 54534) =f(r154) +--+: f (x5; 2) = Ai wees jew Ps = eA DIX 
Applying (15.1) with these two components, the posterior distribution of 2 given the observed data is 


MAG ts 2) ae BEALS eA DW) 
i n(A)f(x1,-.-,%5;4) dd a jek BoA ai. fig foe-A + 5%) dd 


n(Alx1, sig ins) = 


Suppose the five observed interemission times are x, = 0.66, x2 = 0.48, x3 = 0.44, x, = 0.71, 
Xs = 0.56. The sum of these five times is }* x; = 2.85, and so the posterior distribution simplifies to 


Mest ARS! os) age 
m(A|0.66, . . .,0.56) = Peama 6 Ke A>0 


The integral in the denominator was evaluated using the gamma integral formula (4.5) from Chapter 4; 
as noted previously, the purpose of this integral is to guarantee that the posterior distribution of 4 is a 
valid probability density. As a function of 4, we recognize this as a gamma distribution with 
parameters « = 7 and f = 1/3.85. The prior and posterior density curves of 2 appear in Figure 15.2. 
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Figure 15.2 Prior and posterior distributions of 2 for Example 15.2 | 


Example 15.3 A 2010 National Science Foundation study found that 488 out of 939 surveyed adults 
incorrectly believe that antibiotics kill viruses (they only kill bacteria). Let @ denote the proportion of 
all U.S. adults that hold this mistaken view. Imagine that an NSF researcher, in advance of 
administering the survey, believed (hoped?) the value of 0 was roughly 1 in 3, but he was very 
uncertain about this belief. Since any proportion must lie between 0 and 1, the standard beta family of 
distributions from Section 4.5 provides a natural source of priors for 0. One such beta distribution, 
with an expected value of 1/3, is the Beta(2, 4) model whose pdf is 


n(0) = 200(1-0)° 0<0<1 


The data mentioned at the beginning of the example can be considered either a random sample of size 
939 from the Bernoulli distribution or, equivalently, a single observation from the binomial distri- 
bution with n = 939. Let Y = the number of U.S. adults in a random sample of 939 that believe 
antibiotics kill viruses. Then Y ~ Bin(939, 0), and the pmf of Y is p(x; 0) = Creu =o, 


Substituting the observed value y = 488, (15.1) gives the posterior distribution of 0 as 
3 939 \ p488 451 

7(0)p(488;0) a i (cs.)° Oe 

Jo 2(0)p(488; 0) df 200(1 — 0)°- (00) ash" ab 


4 


n(O|Y = 488) = 


7 (1 a 9)*> 
- ie er = 0)**4d0 


=c:-0°(1-0)"* 0<0<1 


Recall that the constant c, which equals the reciprocal of the integral in the denominator, serves to 
insure that the posterior distribution z(0|Y = 488) integrates to 1. Rather than evaluating the integral, 
we can simply recognize the expression 6**°(1 — 0)*>4 as a standard beta distribution, specifically 
with parameters « = 490 and f£ = 455, that’s just missing the constant of integration in front. 
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It follows that the posterior distribution of 0 given Y = 488 must be Beta(490, 455); if we require c, it 
can be determined directly from the beta pdf. 

This trick comes in handy quite often in Bayesian statistics: if we can recognize a posterior 
distribution as being proportional to a particular probability distribution, then it must necessarily be 
that distribution. 

The prior and posterior density curves for 0 are displayed in Figure 15.3. While the prior distri- 
bution is centered around 1/3 and exhibits a great deal of uncertainty (variability), the posterior 
distribution of @ is centered much closer to the sample proportion of incorrect answers, 
488/939 = .52, with considerably less uncertainty. 


a b 
n(9) nly) 


Figure 15.3 Density curves for the parameter 6 in Example 15.3: (a) prior Beta(2, 4), (b) posterior Beta(490, 455) i 


Conjugate Priors 

In the examples of this section, prior distributions were chosen partially by matching the mean of a 
distribution to someone’s a priori “best guess” about the value of the parameter. We also mentioned at 
the beginning of the section that the variance of the prior distribution often reflects the strength of that 
belief. In practice, there is a third consideration for choosing a prior distribution: the ability to apply 
(15.1) in a simple fashion. Ideally, we would like to choose a prior distribution from a family 
(gamma, beta, etc.) such that the posterior distribution is from that same family. When this happens 
we say that the prior distribution is conjugate to the data distribution. 


In Example 15.2, the prior 2(A) is the Gamma(2, 1) pdf; we determined, using (15.1), that the 
posterior distribution was Gamma(7, 1/3.85). It can be shown in general (Exercise 6) that any gamma 
distribution is conjugate to an exponential data distribution. Similarly, the prior and posterior dis- 
tributions of 0 in Example 15.3 were Beta(2, 4) and Beta(490, 455), respectively. The following 
proposition generalizes the result of Example 15.3. 


PROPOSITION Let X1,...,X, be a random sample from a Bernoulli distribution with unknown 
parameter value p. (Equivalently, let Y = )> X; be a single observation from a 
Bin(n, p) distribution). If p is assigned a beta prior distribution with parameters o% 
and fo, then the posterior distribution of p given the x;’s is the beta distribution 
with parameters « = a+ y and f = By +n—y. 
That is, the beta distribution is a conjugate prior to the Bernoulli (or binomial) 
data model. 
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Proof The joint Bernoulli pmf of X1,...,X, is 


pri = i at pl =p) = p(1 =e = pl —p)"” 


The prior distribution assigned to p is n(p) x p*~!(1 — p)Por! Apply (15.1): 


wpe,» -4%n) x pH — pyr! . pam(1 — p) "2% 
= pelt eA] — p)bo-l+n-Y om = perl SS 


As a function of p, we recognize this last expression as proportional to the beta pdf with parameters 
a=oa+yand f=f)tn-y. a 


The values % and fo in the foregoing proposition are called hyperparameters; they are the 
parameters of the distribution assigned to the original parameter, p. In Example 15.2, the prior 
distribution (2) = Ae~* is the Gamma(2, 1) pdf, so the hyperparameters of that distribution are 
%& = 2 and fy = 1. 

Conjugate priors have been determined for several of the named data distributions, including 
binomial, Poisson, gamma, and normal. For two-parameter families such as gamma and normal, it is 
sometimes reasonable to assume one parameter has a known value and then assign a prior distribution 
to the other. In some instances, a joint prior distribution for the two parameters can be found such that 
the posterior distribution is tractable, but these are less common. 


PROPOSITION Let X,,...,X, be arandom sample from a N(u, ¢) distribution with o known. If wu 
is assigned a normal prior distribution with hyperparameters fo and oo, then the 
posterior distribution of 4 is also normal, with posterior hyperparameters 


That is, the normal distribution is a conjugate prior for j in the normal data model 
when o is assumed known. 


Proof To determine the posterior distribution of yu, apply (15.1): 


1 1 1)? 1 
m(L\x1, ae ihn) ee n(M)f (x1, a 5 Xnj Wb) = Jano eo Mo) [295 x —___ ag (-#) j2 
0 


en =p)’ /20? 
2no0 V2n0 


oc eT) —H)? /0? + + nH)? 0? + (Ho)? / 95) 
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The trick here is to complete the square in the exponent, which yields 


m(ulx1, -- Xn) x en) 2a, 


where C does not involve jy and 


+ 4 
r 1 o Oy. oF ge 
o= = 
Ln 1 My n Al 1 n 1 
oF <- 1G, o oo oe Ge 


With respect to yu, the last expression for 2(1|x1,...,X,) is proportional to the normal pdf with 
parameters j, and o}. o 


To make sense of these messy parameter expressions we define the precision, denoted by t, as the 
reciprocal of the variance (because a lower variance implies a more precise measurement), and the 
weights then are the corresponding precisions. If we let t=1/o*, to =1/o03, and 
t=1/ oF = n/o’, then the posterior hyperparameters in the previous proposition can be restated as 

_ ty X+T0° Ho 


= +—_— and 1 =tTy+T09 
My tz +t x 


The posterior mean pt, is a weighted average of the prior mean fio and the data mean x, and the 


posterior precision is the sum of the prior precision plus the precision of the sample mean. 


Exercises: Section 15.1 (1-10) 


1. A certain type of novelty coin is manufac- 
tured so that 80% of the coins are fair while 
the rest have a .75 chance of landing heads. 
Let @ denote the probability of heads for a 
novelty coin randomly selected from this 
population. 


a. Express the given information as a prior 
distribution for the parameter 0. 

b. Five tosses of the randomly selected 
coin result in the sequence HHHTH. 
Use this data to determine the posterior 
distribution of 0. 


2. Three assembly lines for the same product 
have different nonconformance rates: p = .1 
for Line A, p = .15 for Line B, and p = .2 
for Line C. One of the three lines will be 
selected at random (but you don’t know 
which). Let X = the number of items 
inspected from the selected line until a 
nonconforming one is found. 


a. What is the distribution of X, as a 
function of the unknown p? 


b. Express the given information as a prior 
distribution for the parameter p. [Hint: 
There are three possible values for p. 
What should be their a_ priori 
likelihoods?] 

c. It is determined that the 8th item coming 
off the randomly selected line is the first 
nonconforming one. Use this informa- 
tion to determine the posterior distribu- 
tion of p. 


3. The number of customers arriving during a 
one-hour period at an ice cream shop is 
modeled by a Poisson distribution with 
unknown parameter uw. Based on past 
experience, the owner believes that the 
average number of customers in one hour is 
about 15. 


a. Assign a prior to uw from the gamma 
family of distributions, such that the mean 
of the prior is 15 and the standard devia- 
tion is 5 (reflecting moderate uncertainty). 
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b. The number of customers in ten ran- 7. Let X1,...,X, be a random sample from 
domly selected one-hour intervals is a negative binomial distribution with 
recorded: r known and p unknown. Assume a Beta 
ie & Vin Ae dee Ge te (a0, Bo) prior distribution for P. Show that 
the posterior distribution of p is also a beta 
Determine the posterior distribution of yw. distribution, and identify the updated 
4. At children’s party, kids toss ping-pong hyperparameters. 
balls into the swimming pool hoping to 8. Consider a random sample X,, Xo, ..., Xp 
land inside a plastic ring (picture a small from the normal distribution with mean 0 
hula hoop). Let p denote the probability of and precision t (use t as a parameter 
successfully tossing a ball into the ring, instead of o? = 1/t). Assume a gamma- 
which we will assign a Beta(1, 3) prior distributed prior for t and show that the 
distribution. The variable X = number of posterior distribution of t is also gamma. 
tosses required to get 5 balls in the ring (at What are its parameters? 
which point the child wins a prize) will be : : : 
. 9. Wind speeds in a certain area are modeled 
observed for a sample of children. . aca : 
using a lognormal distribution, with 
a. What is a reasonable distribution for the unknown first parameter . and known 
rv X?) What are its parameters? second parameter o = 1. Suppose yp is 
b. The number of tosses required by eight assigned a normal prior distribution with 
kids was mean jlo and precision to. Based on 
observing a random sample of wind speeds 
12 8 14 12 12 6 8 27 eee. : 
X1,---,;X, in this area, determine the poste- 
Determine the posterior distribution of p. rior distribution of ju. 
5. Consider a random sample X;, X2, ..., Xn 10, Wait times for Uber rides as people exit a 
from the Poisson distribution with mean w. certain sports arena are uniformly dis- 
If the prior distribution for 4 is a gamma tributed on the interval [0, 0] with 0 
distribution with hyperparameters a and unknown. Suppose the following Pareto 
Bo, show that the posterior distribution is prior distribution is assigned to 0: 
also gamma distributed. What are its 
hyperparameters? 24, 000 
m0) =—| > 20 
6. Suppose you have a random sample X,, X>, 0 
...» X, from the exponential distribution . fe 
Based on observing the wait times 


with parameter 4. If a gamma distribution 
with hyperparameters % and fg is assigned 
as the prior distribution for 7, show that the 
posterior distribution is also gamma dis- 
tributed. What are its hyperparameters? 


X1,..-;X, Of n Uber customers, determine 
the posterior distribution of 0. [Hint: Some 
care must be taken to address the bound- 
aries 0 > 20 and x; < 0.] 


15.2 Bayesian Point and Interval Estimation 


The previous section introduced the paradigm of Bayesian inference, wherein parameters are not just 
regarded as unknown but as having a distribution of possible values prior to observing any data. Such 
prior distributions are, by definition, valid probability distributions with respect to the parameter 0. The 
key to Bayesian inference is Equation (15.1), which applies Bayes’ theorem for random variables to @ and 
the sample X,...,X,. The result is an update to our belief about 0, called the posterior distribution. 

From a Bayesian perspective, the posterior distribution of @ represents the most complete 
expression of what can be inferred from the sample data. But the posterior distribution can give rise to 
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point and interval estimates for the parameter 0, although the interpretation of the latter differs from 
that of our earlier confidence intervals. 


Bayesian Point Estimators 


Although specifying a single-value estimate of an unknown parameter conflicts somewhat with the 
Bayesian philosophy, there are occasions when such an estimate is desired. The most common 
Bayesian point estimator of a parameter 0 is the mean of its posterior distribution: 


0 = E(0|X),...,Xn) (15.2) 


Hereafter we shall refer to Expression (15.2) as the Bayes estimator of 0. A Bayes estimate is 
obtained by plugging in the observed values of the X;,’s, resulting in the numerical value 


6 = E(6|x,,...,%n). 


Example 15.4 (Example 15.2 continued) The posterior distribution of the exponential parameter 
given the five observed interemission times was determined to be gamma with parameters « = 7 and 
6 = 1/3.85. Since the mean of a gamma distribution is «f, the Bayes estimate of 4 here is 


} = E(A|0.66,...,0.56) = af = 7(1/3.85) = 1.82 


This isn’t too different from the researchers’ prior belief that 2 ~ 2. 

If we retrace the steps that led to this posterior distribution, we find more generally that for data 
model X,,...,X, ~ exponential(/) and prior 7 ~ gamma(%, fo), the posterior distribution of / is 
the gamma pdf with « = a + n and 1/8 = 1/B + >> X;. Therefore, the Bayes estimator for 4 in this 
scenario is 


Ooo tn 


A= E(A\X),.-.,Xn) = 0B = TK, 


Although the mean of the posterior distribution is commonly used as a point estimate for a 
parameter in Bayesian inference, that is not the only available choice. Some practitioners prefer to use 
the mode of the posterior distribution rather than the mean; this choice is called the maximum a 
posteriori (MAP) estimate of @. For small samples, the Bayes estimate (i.c., mean) and MAP 
estimate can differ considerably. Typically, though, when n is large these estimates for the parameter 
will be reasonably close. This makes sense intuitively, since as m increases any sensible estimator 
should converge to the true, single value of the parameter (i.e., be consistent). 


Example 15.5 (Example 15.3 continued) The posterior distribution of the parameter = the proportion 
of all U.S. adults that incorrectly believe antibiotics kill viruses was determined to have a Beta(490, 455) 
distribution. Since the mean of a Beta(«, ) distribution is «/(«-+ f), a point estimate of 0 is 


490 490 
_ = 5 
4904455 9458 


0 = E(0ly = 488) 


It can be shown that the mode of a beta distribution occurs at (« — 1)/(a+ 6 — 2) provided that 
a > 1 and f > 1. Hence the MAP estimate of 0 here is (490 — 1)/(490 + 455 — 2) = 489/943 = .5186. 
Notice these are both quite close to the frequentist estimate yw/n = 488/939 = .5197. i 
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Properties of Bayes Estimators 


In most cases, the contribution of the observed values x,,...,x, in shaping the posterior distribution 
of a parameter @ increases as the sample size n increases. Equivalently, the choice of prior distribution 
is less impactful for large samples, because the data “dominates” that original choice. It can be shown 
that under very general conditions, as n — oo 


(1) the mean of the posterior distribution will converge to the true value of 0, and 
(2) the variance of the posterior distribution of @ converges to zero. 


The second property manifests itself in our two previous examples: the variability of the posterior 
distribution of 2 based on n = 5 observations was still rather substantial, while the posterior distri- 
bution of @ based on a sample of size n = 939 was quite concentrated. In the language of Chapter 7, 
these two properties imply that Bayes estimators are generally consistent. 


Since traditional estimators such as P and X converge to the true values of corresponding 
parameters (e.g., p or “) by the Law of Large Numbers, it follows that Bayesian and frequentist 
estimates will typically be quite close when n is large. This is true both for the point estimates and the 
interval estimates (Bayesian intervals will be introduced shortly). But when 7 is small—a common 
occurrence in Bayesian methodology—parameter estimates based on the two methods can differ 
drastically. This is especially true if the researcher’s prior belief is very far from what’s actually true 
(e.g., believing a proportion is around 1/3 when it’s really greater than .5). 


Example 15.6 Consider a Bernoulli(p) random sample X,,..., X;, or, equivalently, a single binomial 
observation Y = )> X;. If we assign a Beta(%, fo) prior to p, a proposition from the previous section 
establishes that the posterior distribution of p given that Y = y is Beta(a + y, By +n —y). Hence the 
Bayes estimator of p is 


(ao + Y) tor Y 


PH Ei SS (a +Y)+(Botn—Y) oo +Potn 


One way to think about the prior distribution here is that it “seeds” the sample with %) successes and 
Bo failures before data is obtained. The quantities % + Y and fy +n — Y then represent the number of 
successes and failures after sampling, and the Bayes estimator p is the sample proportion of successes 
from this perspective. 

With a little algebra, we can re-express the Bayes estimator as 


ot Y Ay + of Y 
_—— + ee (15.3) 
to+Botn aot+tPotn w+Pyotn a+ Bo 


_ ao + Bo +n Tn 

Expression (15.3) represents the Bayesian estimator p as a weighted average of the prior expec- 
tation of p, %/(% + Bo), and the sample proportion of successes Y/n = )> X;/n. 

By the Law of Large Numbers, the sample proportion of successes Y/n converges to the true value 
of the Bernoulli parameter, which we will denote by p*. Taking the limit of (15.3) as n — oo yields 


Xo 
ao + Bo 


p—0- +1 p =p", 


so that the Bayes estimator is indeed consistent. a 
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Bayesian Interval Estimation 


The Bayes estimate 0 provides a single-value “best guess” for the true value of the parameter 0 based 
on its posterior distribution. An interval [a, b] having posterior probability .95 gives a 95% credible 
interval, the Bayesian analogue of a 95% confidence interval (but the interpretation is different). 
Typically one selects the middle 95% of the posterior distribution; i.e., the endpoints of a 95% 
credible interval are ordinarily the .025 and .975 quantiles of the posterior distribution. 


Example 15.7 (Example 15.4 continued) Given the observed values of X;, ..., X5, we previously 
found that the emission rate parameter A has a Gamma(7, 1/3.85) posterior distribution. A 95% 
credible interval for 2 requires determining the .025 and .975 quantiles of the Gamma(7, 1/3.85) 
model. Using statistical software, 7.925 = 0.7310 and 7.975 = 3.3921, so the 95% credible interval for 
A is (0.7310, 3.3921). Under the Bayesian interpretation, having observed the five aforementioned 
interemission times, there is a 95% posterior probability that 2 is between 0.7310 and 3.3921 
emissions per second. Taking reciprocals, the mean time between emissions (i.e., 1/2) is estimated to 
lie in the interval (1/3.3921, 1/0.7310) = (0.295, 1.368) seconds with posterior probability .95. lH 


Example 15.8 (Example 15.5 continued) The posterior distribution of the parameter 0 = the pro- 
portion of all U.S. adults that incorrectly believe antibiotics kill viruses was a Beta(490, 455) dis- 
tribution. The .025 and .975 quantiles of this beta distribution are 7 925 = .4866 and 7.975 = .5503. So, 
after observing the results of the NSF survey, there is a 95% posterior probability that 0 is between 
.4866 and .5503. 

For comparison, the one-proportion z interval based on y/n = 488/939 = .5197 is 


5197(1 — .5197) 


5197 + 1. 
519 96 939 


= (.4877, .5517) 


Due to the large sample size, the two intervals are quite similar. a 


It must be emphasized that, even if the confidence interval is nearly the same as the credible 
interval for a parameter, they have different interpretations. To interpret the Bayesian credible 
interval, we say that there is a 95% probability that the parameter 0 is in the interval. However, for the 
frequentist confidence interval such a probability statement does not make sense: as we discussed in 
Section 8.1, neither the parameter 0 nor the endpoints of the interval are considered random under the 
frequentist view. (Instead, the confidence level is the long-run capture frequency if the formula is used 
repeatedly on different samples.) 


Example 15.9 Consider the IQ scores of 18 first-grade boys, from the private speech data introduced 
in Exercise 81 from Chapter 1: 


113) 108 «#140 113 115 146 136 107 108 119 132 127 118 108 103 103 122 111 


IQ scores are generally found to be normally distributed, and because IQs have a standard 
deviation of 15 nationwide, we can assume o = 15 is known and valid here. Let’s perform a Bayesian 
analysis on the mean IQ w of all first-grade boys at the school. 

For the normal prior distribution it is reasonable to use a mean of fo = 110, a ballpark figure for 
previous years in this school. It is harder to prescribe a standard deviation for the prior, but we will 
use 09 = 7.5. (This is the standard deviation for the average of four independent observations if the 
individual standard deviation is 15. As a result, the effect on the posterior mean will turn out to be the 
same as if there were four additional observations with average 110.) 
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The last proposition from Section 15.1 states that the posterior distribution of 1 is also normal and 
specifies the posterior hyperparameters. Numerically, we have 


1 1 1 1 1 1 1 
= — T =>. 77 >= SS SS | 
of o/n ss oe 152/18 | 7.52 09778 = T0997 = 3.1982 
mx | Ho  18(118.28) 110 
eo 152 TS 
i eA = 116.77 
oe oF 152° 7.52 


The posterior distribution is normal with mean pu; = 116.77 and standard deviation o, = 3.198. 
The mean jl; is a weighted average of x = 118.28 and uo = 110, so py is necessarily between them. 
As n becomes large the weight given to Mo declines, and “1; will be closer to x. 

The 95% credible interval for p is the middle 95% of the N(116.77, 3.198) distribution, which 
works out to be (110.502, 123.038). For comparison, the 95% confidence interval using x = 118.28 
and og = 15 is ¥£1.960/,/n = (111.35, 125.21). Notice that the one-sample z interval must be 
wider: because the precisions add to give the posterior precision, the posterior precision is greater than 
the prior precision and it is greater than the data precision. Therefore, it is guaranteed that the 
posterior standard deviation o, will be less than both ao and a/,/n. 

Both the credible interval and the confidence interval exclude 110, so we can be pretty sure that 
exceeds 110. Another way of looking at this is to calculate the posterior probability of y being less 
than or equal to 110 (the Bayesian approach to hypothesis testing). Using pw, = 116.77 and 
o, = 3.198, we obtain the probability .0171, supporting the claim that exceeds 110. 

What should be done if there are no prior observations and there are no strong opinions about the 
prior mean Uo? In this case the prior standard deviation op can be taken as some number much larger 
than o, such as dy = 1000 in our example. The result is that the prior will have essentially no effect, 
and the posterior distribution will be based on the data: uw, + x = 118.28 and a; + o = 15. The 95% 
credible interval will be virtually the same as the 95% confidence interval based on the 18 obser- 
vations, (111.35, 125.21), but of course the interpretation is different. | 


Exercises: Section 15.2 (11-20) 


11. Refer back to Exercise 3. a. Show that, if a Beta(1, 1) prior is 
a. Calculate the Bayes estimate of the assigned to p and there are n successes 
Poisson mean parameter 1. in 7 trials, then the posterior mean of p is 

b. Calculate and interpret a 95% credible (n + In + 2). 
interval for 1. b. Explain (a) in terms of total successes 
(2: Reter backia Beewices and failures; that is, explain the result in 


terms of two prior trials plus n later trials. 
c. Laplace applied his rule of succession to 
compute the probability that the sun will 
rise tomorrow using 5000 years, or 


a. Calculate the Bayes estimate of the 
probability p. 
b. Calculate and interpret a 95% credible 


interval for p. n = 1,826,214 days of history in which 

13. Laplace’s rule of succession says that if all the sun rose every day. Is Laplace’s 
n Bernoulli trials have been successes, then method equivalent to including two prior 

the probability of a success on the next trial is days when the sun rose once and failed 

(n + LI/(m + 2). For the derivation, Laplace to rise once? Criticize the answer in 


used a Beta(1, 1) prior for the parameter p. terms of total successes and failures. 
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14. 


15. 


16. 


102 
109 


In a study of 70 restaurant bills, 40 of the 
70 were paid using cash. Let p denote the 
population proportion paying cash. 

a. Assuming a beta prior distribution for 
p with a = 2 and fo = 2, obtain the 
posterior distribution of p. 

b. Repeat (a) with % and fo positive and 
close to 0. 

c. Calculate a 95% credible interval for 
Pp using (b). Is your interval compatible 
with p = .5? 

d. Calculate a 95% confidence interval for 
p using Equation (5.3), and compare 
with the result of (c). 

e. Compare the interpretations of the cred- 
ible interval and the confidence interval. 

f. Based on the prior in (b), test the 
hypothesis p < .5 by using the posterior 
distribution to find P( < .5). 


For the scenario of Example 15.9, assume 
the same normal prior distribution but 
suppose that the data set is just one obser- 
vation x = 118.28 with standard deviation 
o//n = 15/V18 = 3.5355. Use Equa- 
tion (15.1) to derive the posterior distribu- 
tion, and compare your answer with the 
result of Example 15.9. 


Here are the IQ scores for the 15 first-grade 
girls from the study mentioned in Example 
13,9. 


96 106 118 108 122 115 113 
113 82 110 121 110 99 


Assume that the data is a random sample 
from a normal distribution with mean 4 and 
o = 15, and assign to w the same N(110, 
7.5) prior distribution used in Example 15.9. 


a. Determine the posterior distribution of . 

b. Calculate and interpret a 95% credible 
interval for pL. 

c. Add four observations with average 110 
to the data and compute a 95% confi- 
dence one-sample z interval for using 
the 19 observations. Compare with the 
result of (b). 

d. Change the prior so the prior precision is 
very small but positive, and then re- 
compute (a) and (b). 


17. 


18. 


19. 


20. 
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e. Calculate a 95% confidence one-sample 
z interval for ps using the 15 observations 
and compare with the credible interval 
of (d). 

If « and f are large, then the beta distri- 

bution can be approximated by the normal 

distribution using the beta mean and vari- 
ance given in Section 4.5. This is useful in 
case beta distribution software is unavail- 
able. Use the approximation to compute the 
credible interval in Example 15.8. 


Two political scientists wish to forecast the 

proportion of votes that a certain U.S. 

senator will earn in her upcoming reelection 

contest. The first political scientist assigns a 

Beta(3, 2) prior to p = the true support rate 

for the senator, while the second assigns a 

Beta(6, 6) prior. 

a. Determine the expectations of both prior 
distributions. 

b. Which political scientist appears to feel 
more sure about this prior belief? How 
can you tell? 

c. Determine both Bayes estimates in this 
scenario, assuming that y out of n ran- 
domly selected voters indicate they will 
vote to reelect the senator. 

d. For what survey size n are the two Bayes 
estimates guaranteed to be within .005 
of each other, no matter the value of y? 


Consider a random sample X1,...,X,, from 
a Poisson distribution with unknown mean 
Lt, and assign to a Gamma(%, fo) prior 
distribution. 


a. What is the prior expectation of jy? 

b. Determine the Bayes estimator ju. 

c. Let w* denote the true value of u. Show 
that jf is a consistent estimator of p*. 
[Hint: Look back at Example 15.6.] 


Consider a random sample X,,...,X,, from 
an exponential distribution with parameter 
2, and assign to 24 a Gamma(a%, fo) prior 
distribution. 

a. What is the prior expectation of 2? 

b. Determine the Bayes estimator A. 

c. Let A* denote the true value of 2. Show 


that 7 is a consistent estimator of 2*. 


Appendix 


Table A.1 Cumulative binomial probabilities 


Dp 
0.01 0.05 0.10 0.20 0.25 0.30 0.40 0.50 0.60 
an=5 
x 0 951 774 590 328 237 168 .078 031 010 
1 999 977 919 737 633 528 337 188 .087 
2 1.000 999 991 942 896 837 683 500 317 
3 1.000 1.000 — 1.000 993 984 969 913 812 663 
4 1.000 1.000 1.000 1.000 999 998 990 .969 922 
b.n = 10 


ad 


904 599 349 107 056 028 006 001 000 

996 914 -736 376 .244 149 046 O11 002 
1.000 988 930 678 526 383 167 055 012 
1.000 999 987 879 776 650 382 172 055 
1.000 1.000 .998 967 922 850 633 377 166 
1.000 1.000 994 980 953 834 623 367 
1.000 1.000 1.000 999 .996 989 945 828 618 
1.000 1.000 1.000 1.000 1.000 998 988 945 833 
1.000 1.000 1.000 1.000 1.000 1.000 998 989 954 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 994 


OomArA DNF WNrH OS 
= 
S 
S 
o 


cn=15 
x (0) .860 463 -206 035 013 .005 000 000 .000 
1 990 829 549 167 .080 035 005 000 000 
2 1.000 964 816 398 236 127 .027 004 -000 
3. 1.000 995 944 648 A61 297 O91 018 002 
4 1.000 999 987 836 686 15: 217 059 009 
5 1.000 1.000 998 939 852 722 402 51 034 
6 1.000 1.000 — 1.000 982 943 869 -610 304 095 
7 1.000 1.000 — 1.000 996 983 950 -787 500 213 
8 1.000 1.000 1.000 999 .996 985 905 696 390 
9 1.000 1.000 1.000 1.000 999 .996 966 849 597 
10 =1.000 1.000 1.000 1.000 — 1.000 999 991 941 783 
11 1.000 =1.000 1.000 1.000 1.000 — 1.000 998 982 909 


12. 1.000 1.000 1.000 1.000 1.000 1.000 1.000 996 973 
13. 1.000 =1.000 1.000 1.000 1.000 1.000 1.000 1.000 995 
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
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B(x;n,p) 
0.80 0.90 
.000 = .000 
.007 = .000 
058 009 
263.081 
672 410 
.000 = .000 
.000 =.000 
.000 =.000 
001  .000 
.006 = .000 
033.002 
121.013 
322.070 
624 .264 
893.651 
000 =.000 
.000 = .000 
.000 ~=.000 
.000 =.000 
000 = .000 
.000 = .000 
001  .000 
.004 = .000 
.018 000 
061  .002 
164 = .013 
352  .056 
602  .184 
833 451 
965.794 


DX Oy; 7, p) 
y=0 

0.95 0.99 
.000 = .000 
.000 = .000 
001 —.000 
023.001 
226 ~=.049 
000  =—.000 
000 ~—.000 
.000 = .000 
.000 = .000 
.000 = .000 
000 = .000 
001 —.000 
012 = .000 
086 = .004 
A401  .096 
.000 = .000 
000 = .000 
000 = .000 
.000 = .000 
.000 = .000 
.000 ~=—.000 
000 ~=.000 
.000 = .000 
.000 = .000 
.000 = .000 
001 —.000 
005 = .000 
.036 ~~ .000 
A71 .010 
537 ~—-«.140 
(continued) 
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Table A.1 (continued) 


B(x;n,p) b(y:n,p) 


i 
iM 


y 


0.01 0.05 0.10 0.20 0.25 0.30 0.40 0.50 0.60 0.70 0.75 0.80 0.90 0.95 0.99 
d.n = 20 


x 0 818 358 122 012 003 001 000 000 .000 000 .000 000 .000 §=.000 ~=—.000 
1 983 -736 392 .069 .024 008 O01 000 000 000 .000 000 .000 §=.000 ~=—.000 
2 999 925 .677 .206 091 035 004 000 .000 000 .000 .000 000 ©.000 ~—.000 
3. 1.000 984 867 All .225 107 016 001 .000 000.000 .000 .000 §©.000 = .000 
4 1.000 997 957 .630 AIS 238 O51 006 .000 000.000 .000 .000 §©.000 ~—.000 
5 1.000 1.000 989 .804 617 416 126 021 002 000 .000 000 .000 §=.000 ~=—.000 
6 1.000 — 1.000 998 913 .786 608 -250 058 006 000.000 .000 .000 000 ~~ = .000 
7 1.000 1.000 — 1.000 .968 898 .772 416 132 021 001 .000 .000 000 000 ~§©.000 
8 1.000 1.000 1.000 990 959 887 596 252 O57 005 .001 .000 .000 000 = .000 
9 1.000 1.000 — 1.000 997 .986 952 2155 412 128 017. 004 —-.001_~=— 000) .000~—-.000 

10 ~=1.000 =1.000 1.000 999 .996 983 872 588 .245 048 .014 = .003 =.000 =.000_~—.000 
11 1.000 =1.000 1.000 1.000 999 995 943 -748 404 113) 041. = -.010 = 000 = .000-~—-.000 
12. 1.000 1.000 1.000 1.000 — 1.000 999 979 868 584 228 = =.102) =.032, = 000 )=.000-~—.000 
13. 1.000 =1.000 1.000 1.000 1.000 — 1.000 994 942 .750 392.214 087 =.002 =.000_~—-.000 
14 1.000 1.000 1.000 1.000 1.000 — 1.000 998 979 .874 584 =.383) 196 = 011 = .000-~—-.000 
15 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 994 949 762. 585.370.043.003 ~— «000 
16 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 999 984 893.775.589.133 016.000 
17. 1.000 =1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 965 909 .794 323 075 001 
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 999 992 976 .931 608 .264 017 
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 999 997 988878642182 
e.n=25 

x 0 778 277 .072 004 001 000 000 .000 .000 000 .000  .000 .000 .000 ~~ .000 
1 974 642 271 027 007 002 000 000 .000 000 .000 000 .000 §©.000 ~=.000 
2 998 873 537 098 032 009 000 000 .000 000 .000 .000 .000 000 ~~ .000 
31.000 966 764 234 096 .033 002 000 .000 000 .000 000 .000 §=.000 ~=—.000 
4 1.000 993 902 A21 214 .090 009 000 000 000 .000 000 .000 §=.000 ~=—-.000 
5 1.000 999 .967 617 378 193 029 002 000 .000 .000 000 .000 §=.000 ~=—.000 
6 1.000 — 1.000 991 .780 561 341 .074 007 000 000.000 .000 .000 §=.000 ~=—.000 
7 1.000 — 1.000 998 891 727 512 154 022 001 000 .000 .000 .000 000 ~~ = .000 
8 1.000 1.000 1.000 953 851 .677 274 054 004 000 .000 .000 .000 §©.000 ~=—.000 
9 1.000 1.000 — 1.000 983 929 811 425 115 013 000 000 .000 = .000 §=.000 ~=—-.000 

10 =1.000 1.000 1.000 994 .970 902 586 212 034 002.000 = .000 §=.000 §=.000-~=—-.000 
11 =1.000 =1.000 1.000 998 980 .956 -732 345 078 006 001 .000 .000 000 = .000 
12. 1.000 1.000 1.000 1.000 997 983 846 500 154 017, 003. 000 »=.000 = .000_~—.000 
13. 1.000 =1.000 1.000 1.000 999 994 922 655 268 044 020 .002 .000 §=.000~=—.000 
14 1.000 1.000 1.000 1.000 — 1.000 998 966 -788 A414 098 .030 .006 .000 000 .000 
15 1.000 1.000 1.000 1.000 1.000 1.000 987 885 575 189 071 =.017.— 000 = .000-~—-.000 
16 1.000 1.000 1.000 1.000 1.000 — 1.000 996 946 .726 323. 149.047, 000 = .000-~—.000 
17. —-1.000 =1.000 =1.000 1.000 1.000 — 1.000 999 978 846 A488 -.273, «109 .002—s- .000-~—.000 
18 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 993 926 659 439.220 = .009 = .000 ~=.000 
19 =1.000 1.000 1.000 1.000 1.000 1.000 1.000 998 971 807.622.) 383. 033s 001 _~—-.000 
20 1.000 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 991 910 = .786 )=—.579 098 = .007_~—— 000 
21 1.000 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 998 967 .904 .766 .236 §=.034 ~—-.000 
22 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 91 968 =—.902—s 463 127-— 002 
23 1.000 =1.000 1.000 1.000 1.000 1.000 1.000 1.000 — 1.000 998 =.993, 973.729.358.026 
24 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 .996 .928 .723  .222 
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Table A.2 


DNNkWNF OC 


ONDUNHPWNH OC 


905 
Cumulative Poisson probabilities 
P(x; 4) = 32 
y=0 Ye 
m 
of 2 3 4 fe) 6 7 8 9 1.0 
905 819 741 .670 .607 549 497 449 407 368 
995 982 .963 938 910 878 844 809 772 .736 
1.000 999 .996 992 986 977 .966 953 937 .920 
1.000 1.000 .999 998 997 994 991 .987 981 
1.000 1.000 1.000 999 999 998 996 
1.000 1.000 1.000 999 
1.000 
a 
2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 15.0 20.0 


135 050 018 .007 002 001 000 .000 .000 .000 .000 
406 199 092 040 O17 .007 003 001 .000 .000 .000 


.677 423 238 125 .062 030 014 006 .003 000 .000 
857 647 433 .265 ASI .082 042 021 010 .000 .000 
947 .815 629 440 285 173 100 055 029 001 .000 
983 .916 785 616 446 301 191 116 .067 .003 .000 
995 .966 889 762 .606 450 313 207 130 008 .000 


999 988 949 867 744 599 453 324 220 018 .001 
1.000 .996 979 932 847 729 593 456 333 037 .002 
.999 992 .968 .916 830 717 587 458 .070 .005 

1.000 997 .986 957 901 816 706 583 118 O11 

999 995 980 947 888 .803 .697 185 021 

1.000 998 991 973 .936 876 792 .268 .039 

.999 .996 987 .966 926 864 363 .066 

1.000 999 994 983 959 917 466 .105 

999 998 992 978 951 568 157 

1.000 999 .996 989 .973 664 221 

1.000 998 995 986 749 297 

999 998 993 819 381 

1.000 999 .997 875 470 

1.000 998 917 559 

.999 947 644 

1.000 .967 721 


981 .787 
989 843 
994 888 
997 922 
998 948 
999 .966 
1.000 .978 
.987 
992 
995 
997 
999 
999 


1.000 


906 


Table A.3 Standard normal curve areas 


.0003 
.0005 
.0007 
.0010 
.0013 
0019 
.0026 
.0035 
.0047 
.0062 
.0082 
.0107 
.0139 
.0179 
0228 
.0287 
0359 
0446 
0548 
.0668 
.0808 
.0968 
1151 
.1357 
1587 
1841 
2119 
.2420 
.2743 
3085 
3446 
3821 
4207 
4602 
5000 


.0003 
0005 
.0007 
.0009 
.0013 
.0018 
0025 
.0034 
0045 
.0060 
.0080 
.0104 
.0136 
0174 
0222 
0281 
0352 
.0436 
0537 
.0655 
.0793 
0951 
1131 
1335 
1562 
1814 
.2090 
.2389 
.2709 
3050 
3409 
3783 
4168 
4562 
4960 


.0003 
0005 
.0006 
.0009 
.0013 
.0017 
0024 
.0033 
0044 
0059 
.0078 
.0102 
.0132 
.0170 
.0217 
0274 
0344 
0427 
0526 
.0643 
.0778 
0934 
1112 
1314 
1539 
1788 
.2061 
.2358 
.2676 
3015 
3372 
3745 
4129 
4522 
4920 


.0003 
.0004 
.0006 
.0009 
0012 
0017 
.0023 
.0032 
0043 
.0057 
.0075 
.0099 
.0129 
.0166 
0212 
.0268 
.0336 
0418 
0516 
.0630 
.0764 
.0918 
1093 
1292 
1515 
.1762 
.2033 
2327 
.2643 
.2981 
3336 
3707 
4090 
4483 
4880 


.0003 
.0004 
.0006 
.0008 
.0012 
.0016 
.0023 
0031 
0041 
0055 
.0073 
.0096 
0125 
.0162 
.0207 
.0262 
0329 
.0409 
0505 
.0618 
.0749 
.0901 
1075 
1271 
1492 
.1736 
.2005 
.2296 
2611 
.2946 
3300 
3669 
4052 
4443 
4840 


@(z) = P(Z Sz) 


Standard normal density curve 


.0003 
.0004 
.0006 
.0008 
0011 
.0016 
0022 
.0030 
.0040 
.0054 
0071 
0094 
0122 
0158 
.0202 
0256 
0322 
.0401 
0495 
.0606 
0735 
0885 
1056 
1251 
.1469 
A711 
1977 
.2266 
.2578 
2912 
3264 
3632 
4013 
4404 
4801 


.0003 
.0004 
.0006 
.0008 
0011 
.0015 
.0021 
.0029 
.0039 
.0052 
.0069 
0091 
0119 
0154 
.0197 
.0250 
.0314 
.0392 
0485 
0594 
.0722 
.0869 
1038 
1230 
1446 
.1685 
1949 
.2236 
.2546 
2877 
3228 
3594 
3974 
4364 
4761 


.0003 
.0004 
0005 
.0007 
.0010 
0014 
.0020 
.0027 
.0037 
.0049 
.0066 
.0087 
0113 
.0146 
.0188 
0239 
.0301 
0375 
.0465 
0571 
.0694 
.0838 
.1003 
.1190 
1401 
.1635 
1894 
.2177 
2483 
.2810 
3156 
3520 
3897 
4286 
4681 
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Shaded area = ®(z) 


.09 
.0002 
.0003 
.0005 
.0007 
.0010 
.0014 
.0019 
.0026 
.0036 
.0048 
.0064 
.0084 
.0110 
.0143 
.0183 
.0233 
.0294 
.0367 
.0455 
.0559 
.0681 
.0823 
.0985 
.1170 
.1379 
1611 
.1867 
2148 
2451 
.2716 
3121 
3482 
3859 
4247 
4641 

(continued) 
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Table A.3 (continued) 


5040 
5438 
5832 
.6217 
6591 
.6950 
7291 
7611 
.7910 
8186 
8438 
.8665 
8869 
.9049 
9207 
9345 
.9463 
9564 
.9649 
.9719 
.9778 
.9826 
.9864 
.9896 
9920 
9940 
9955 
.9966 
9975 
9982 
9987 
9991 
9993 
9995 
9997 


5080 
5478 
5871 
.6255 
.6628 
6985 
7324 
.7642 
.7939 
8212 
8461 
.8686 
8888 
.9066 
9222 
.9357 
9474 
9573 
.9656 
.9726 
.9783 
.9830 
.9868 
.9898 
9922 
9941 
.9956 
.9967 
.9976 
9982 
.9987 
9991 
9994 
9995 
9997 


5120 
517 
5910 
.6293 
.6664 
.7019 
.7357 
.7673 
£7967 
8238 
8485 
8708 
8907 
.9082 
.9236 
.9370 
9484 
9582 
.9664 
.9732 
.9788 
9834 
9871 
9901 
9925 
9943 
9957 
.9968 
9977 
9983 
.9988 
9991 
9994 
.9996 
9997 


5160 
5557 
5948 
.6331 
.6700 
.7054 
7389 
.7704 
7995 
8264 
8508 
8729 
8925 
.9099 
9251 
9382 
9495 
9591 
9671 
.9738 
.9793 
.9838 
9875 
9904 
9927 
9945 
9959 
.9969 
9977 
9984 
.9988 
9992 
9994 
.9996 
.9997 


5199 
5596 
5987 
.6368 
.6736 
7088 
7422 
.7734 
8023 
8289 
8531 
8749 
8944 
9115 
9265 
9394 
9505 
9599 
.9678 
9744 
.9798 
9842 
.9878 
.9906 
9929 
.9946 
.9960 
.9970 
.9978 
9984 
.9989 
.9992 
9994 
.9996 
.9997 


5239 
5636 
.6026 
.6406 
.6772 
7123 
.7454 
.7764 
8051 
8315 
8554 
8770 
8962 
9131 
9278 
.9406 
9515 
.9608 
.9686 
9750 
.9803 
.9846 
9881 
.9909 
9931 
9948 
9961 
9971 
.9979 
9985 
9989 
9992 
9994 
.9996 
9997 
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Table A.4 The incomplete gamma function 


x 1 2 3 4 5 6 7 8 9 10 
1 632 264 080 019 004 001 000 000 000 000 
2 865 594 323 143 053 O17 005 001 000 000 
3 950 801 577 353 .185 084 034 012 004 001 
4 982 908 762 567 371 215 111 051 021 008 
5 993 960 875 735, 560 384 238 133 068 032 
6 998 983 938 849 715 554 394 256 153 084 
7 999 993 970 918 827 699 550 401 271 170 
8 1.000 997 986 958 900 809 687 547 407 283 
9 999 994 979 945 884 793 676 544 413 

10 1.000 997 990 971 933 870 780 667 542 

ll 999 995 985 .962 921 857 768 659 

12 1.000 998 992 980 954 911 845 758 

13 999 996 989 974 946 900 834 

14 1.000 998 994 986 968 938 891 
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Table A.5 Critical values for chi-squared distributions 


<= 


eRe ee 
WNrFR TOANDAUNUBRWNKE 


ll tll ll ell ol ned 
COMIN MNA 


WWWWWWWWWWNNNNNNNYNNN WY 
OCWAANDMAWNHK TOANINIAUNBRWNK OO 


40 


3 

2 2 
For v > 40,72, + v{ l-— — 
or v > 40, 73. ( ZtuyZ) 


995 
0.000 
0.010 
0.072 
0.207 
0.412 
0.676 
0.989 
1.344 
1.735 
2.156 
2.603 
3.074 
3.565 
4.075 
4.600 
5.142 
5.697 
6.265 
6.843 
7.434 
8.033 
8.643 
9.260 
9.886 
10.519 
11.160 
11.807 
12.461 
13.120 
13.787 
14.457 
15.134 
15.814 
16.501 
17.191 
17.887 
18.584 
19.289 
19.994 
20.706 


99 
0.000 
0.020 
0.115 
0.297 
0.554 
0.872 
1.239 
1.646 
2.088 
2.558 
3.053 
3.571 
4.107 
4.660 
5.229 
5.812 
6.407 
7.015 
7.632 
8.260 
8.897 
9.542 

10.195 
10.856 
11.523 
12.198 
12.878 
13.565 
14.256 
14.954 
15.655 
16.362 
17.073 
17.789 
18.508 
19.233 
19.960 
20.691 
21.425 
22.164 


975 
0.001 
0.051 
0.216 
0.484 
0.831 
1.237 
1.690 
2.180 
2.700 
3.247 
3.816 
4.404 
5.009 
5.629 
6.262 
6.908 
7.564 
8.231 
8.906 
9.591 
10.283 
10.982 
11.688 
12.401 
13.120 
13.844 
14.573 
15.308 
16.047 
16.791 
17.538 
18.291 
19.046 
19.806 
20.569 
21.336 
22.105 
22.878 
23.654 
24.433 


0.004 
0.103 
0.352 
0.711 
1.145 
1.635 
2.167 
2.733 
3.325 
3.940 
4.575 
5.226 
5.892 
6.571 
7.261 
7.962 
8.682 
9.390 
10.117 
10.851 
11.591 
12.338 
13.090 
13.848 
14.611 
15.379 
16.151 
16.928 
17.708 
18.493 
19.280 
20.072 
20.866 
21.664 
22.465 
23.269 
24.075 
24.884 
25.695 
26.509 


0.016 
0.211 
0.584 
1.064 
1.610 
2.204 
2.833 
3.490 
4.168 
4.865 
5.578 
6.304 
7.041 
7.790 
8.547 
9.312 
10.085 
10.865 
11.651 
12.443 
13.240 
14.042 
14.848 
15.659 
16.473 
17.292 
18.114 
18.939 
19.768 
20.599 
21.433 
22.271 
23.110 
23.952 
24.796 
25.643 
26.492 
27.343 
28.196 
29.050 


2.706 

4.605 

6.251 

7.7719 

9.236 
10.645 
12.017 
13.362 
14.684 
15.987 
17.275 
18.549 
19.812 
21.064 
22.307 
23.542 
24.769 
25.989 
27.203 
28.412 
29.615 
30.813 
32.007 
33.196 
34.381 
35.563 
36.741 
37.916 
39.087 
40.256 
41.422 
42.585 
43.745 
44.903 
46.059 
47.212 
48.363 
49.513 
50.660 
51.805 
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x density curve 
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: t xe» 

05 025 O01 .005 
3.843 5.025 6.637 7.882 
5.992 7.378 9.210 10.597 
7.815 9.348 11.344 12.837 
9.488 11.143 13.277 14.860 

11.070 12.832 15.085 16.748 
12.592 14.440 16.812 18.548 
14.067 16.012 18.474 20.276 
15.507 17.534 20.090 21.954 
16.919 19.022 21.665 23.587 
18.307 20.483 23.209 25.188 
19.675 21.920 24.724 26.755 
21.026 23.337 26.217 28.300 
22.362 24.735 27.687 29.817 
23.685 26.119 29.141 31.319 
24.996 27.488 30.577 32.799 
26.296 28.845 32.000 34.267 
27.587 30.190 33.408 35.716 
28.869 31.526 34.805 37.156 
30.143 32.852 36.190 38.580 
31.410 34.170 37.566 39.997 
32.670 35.478 38.930 41.399 
33.924 36.781 40.289 42.796 
35.172 38.075 41.637 44.179 
36.415 39.364 42.980 45.558 
37.652 40.646 44.313 46.925 
38.885 41.923 45.642 48.290 
40.113 43.194 46.962 49.642 
41.337 44.461 48.278 50.993 
42.557 45.772 49.586 52.333 
43.773 46.979 50.892 53.672 
44.985 48.231 52.190 55.000 
46.194 49.480 53.486 56.328 
47.400 50.724 54.774 57.646 
48.602 51.966 56.061 58.964 
49.802 53.203 57.340 60.272 
50.998 54.437 58.619 61.581 
52.192 55.667 59.891 62.880 
53.384 56.896 61.162 64.181 
54.572 58.119 62.426 65.473 
55.758 59.342 63.691 66.766 
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Table A.6 Critical values for ¢ distributions 


Ol ee 
TPOmAAIADMPWNFrF TOAAIDMNHPWNeY = 


NNN 
WN Re 


NNNNN 
ONAN MHL 


DAnNPwWwWwwwy 
CoCoOoMAankNO”Y 


120 


10 
3.078 
1.886 
1.638 
1.533 
1.476 
1.440 
1.415 
1.397 
1.383 
1.372 
1.363 
1.356 
1.350 
1.345 
1.341 
1.337 
1.333 
1.330 
1.328 
1.325 
1.323 
1.321 
1.319 
1.318 
1.316 
1.315 
1.314 
1.313 
1.311 
1.310 
1.309 
1.307 
1.306 
1.304 
1.303 
1.299 
1.296 
1.289 
1.282 


.05 
6.314 
2.920 
2.353 
2.132 
2.015 
1.943 
1.895 
1.860 
1.833 
1.812 
1.796 
1.782 
1.771 
1.761 
1.753 
1.746 
1.740 
1.734 
1.729 
1.725 
1.721 
1.717 
1.714 
1.711 
1.708 
1.706 
1.703 
1.701 
1.699 
1.697 
1.694 
1.691 
1.688 
1.686 
1.684 
1.676 
1.671 
1.658 
1.645 


025 


12.706 
4.303 
3.182 
2.776 
2.571 
2.447 
2.365 
2.306 
2.262 
2.228 
2.201 
2.179 
2.160 
2.145 
2.131 
2.120 
2.110 
2.101 
2.093 
2.086 
2.080 
2.074 
2.069 
2.064 
2.060 
2.056 
2.052 
2.048 
2.045 
2.042 
2.037 
2.032 
2.028 
2.024 
2.021 
2.009 
2.000 
1.980 
1.960 


t; 
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density curve 


Shaded area = @ 


.005 
63.657 
9.925 
5.841 
4.604 
4.032 
3.707 
3.499 
3.355 
3.250 
3.169 
3.106 
3.055 
3.012 
2.977 
2.947 
2.921 
2.898 
2.878 
2.861 
2.845 
2.831 
2.819 
2.807 
2.797 
2.787 
2.779 
2.771 
2.763 
2.756 
2.750 
2.738 
2.728 
2.719 
2.712 
2.704 
2.678 
2.660 
2.617 
2.576 


001 
318.31 
22.327 
10.215 
7.173 
5.893 
5.208 
4.785 
4.501 
4.297 
4.144 
4.025 
3.930 
3.852 
3.787 
3.733 
3.686 
3.646 
3.610 
3.579 
3.552 
3.527 
3.505 
3.485 
3.467 
3.450 
3.435 
3.421 
3.408 
3.396 
3.385 
3.365 
3.348 
3.333 
3.319 
3.307 
3.262 
3.232 
3.160 
3.090 


av 


.0005 
636.62 

31.599 

12.924 
8.610 
6.869 
5.959 
5.408 
5.041 
4.781 
4.587 
4.437 
4.318 
4.221 
4.140 
4.073 
4.015 
3.965 
3.922 
3.883 
3.850 
3.819 
3.792 
3.767 
3.745 
3.725 
3.707 
3.690 
3.674 
3.659 
3.646 
3.622 
3.601 
3.582 
3.566 
3.551 
3.496 
3.460 
3.373 
3.291 
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Table A.7 ¢ curve tail areas 


0.0 500 500 .500  .500 
0.1 .468 465 463 463 
0.2 .437 430 .427 426 
0.3. 407 .396 .392 390 
0.4 .379 = .364 = .358 = 355 
05 .352 .333 326 = .322 
0.6 .328 .305 .295  .290 
0.7 306 .278 =.267_ 261 
0.8 .285  .254 .241 .234 
0.9 .267 .232  .217. .210 
1.0 .250 .211 .196 .187 
1.1 235.193.176.167 
1.2  .221 .177) 158 = 148 
1.3 .209 162.142) 132 
14 197 148 128 117 
1.5  .187 136 115.104 
1.6 .178 .125 .104 .092 
1.7 .169 .116 .094 .082 
1.8  .161 .107 .085 .073 
1.9 .154 099 077 .065 
2.0 .148 .092 .070 .058 
2.1 141 .085 .063 .052 
2.2 136 .079 .058 .046 
2.3 .131 .074 052 .041 
24 .126 .069 .048 .037 
2.55 121 065 .044 .033 
2.6 .117 .061 .040 .030 
2.7 113 057) 037.027 
2.8 109 .054 .034 .024 
2.9 106 .051 .031 .022 
3.0 .102 .048 .029 .020 
3.1 099 045 .027 018 
3.2 .096 .043 .025 .016 
3.3. .094 .040 .023 .015 
3.4 .091 .038 .021 .014 
3.5 .089 .036 .020 .012 
3.6 .086 .035 .018 011 
3.7 .084 .033 .017  .010 
3.8 .082 .031 .016 .010 
3.9 .080 .030 .015 .009 
4.0 .078 .029 .014 .008 


v 


500 
461 
423 


.348 
313 
.280 
.249 
.220 
194 
169 
147 
.128 
110 
.095 
081 
.069 
059 
050 
042 
.035 
.030 
.025 
021 
018 
015 
012 
010 
.009 
.007 
.006 
.005 
.004 
.004 
.003 
.002 
.002 
.002 
001 
.001 
.001 


911 


t, density curve Area to the 


right of t 


1 

oF 

t 

13 14 15 16 17 18 

500 .500 500 .500 .500  .500 
461 461 461 461 461 ~=# 461 
422 A422 422 422 422 + .422 
384 .384 .384 384) «=.384 384 
348 347) 347) 347) «347.347 
313 312) 312) 312) S312) .312 
279 279 279) 278) 278.278 
248 .247) 247) 247 «~.247_— 246 
219 218 .218 218 217 217 
192 191 .191 .191 .190 .190 
168 .167  .167 .166 .166 .165 
146 .144 144 144 143 .143 
126 124 124 124 123 .123 
108 .107  .107  .106 .105 .105 
.092 .091 091 .090 .090 .089 
079 077. 077.077. =.076— 075 
.067 .065 .065 .065 .064 .064 
056 055 055 .054 .054 .053 
048 .046 .046 .045 .045 .044 
040 .038 .038 .038 .037 .037 
.033. 032. -.032—-.031_=.031_~=—.030 
.028 .027 027 .026 .025 .025 
023 022 .022 .021 021 .021 
019 .018 .018 .018 017 .O17 
016 015 015 014 014 .014 
013) 012 012 O12 O11 O11 
O11 .010 .010 .010 .009 .009 
.009 .008 .008 .008 .008 .007 
.008 .007 .007 .006 .006 .006 
.006 .005 .005 .005 .005 .005 
00S .004 .004 .004 .004 .004 
004 .004 .004 .003 .003 .003 
.003. .003 .003 .003 .003 .002 
.003 .002 .002 .002 .002 .002 
.002 .002 .002 .002 .002 .002 
002 .002 .002 .001 .001 .001 
.002 .001 .001 001 .001 .001 
001 .001 .001 .001 .001 .001 
001 .001 .001 .001 .001 .001 
001 .001 .001 .001 001 .001 
001 .001 .001 .001 .000 .000 


(continued) 


912 Appendix 
Table A.7 (continued) 


t 19 20 21 22 23 24 25 26 27 28 29 30 35 40 60 120 oo (=z) 
0.0 500 500 500 500 500 500) 500) =©.500) «3.500 3.500 )=3=.500 )=3=.500 )3=.500 )=3=.500 =.500) =.500_—.500 
0.1 461 461 461 461 461 461 461 461 461 461 461 461 460 460 460 460 .460 
0.2 422 422 422 422 422 422 422 422 421 421 421 421 421 421 421 421° 421 
0.3 384 384 384) 383) 383) 383) 383) 383) 383) 383) 383) 383) 383) 383) 383) 382.382 
0.4 347 347 347 347 346 «346 «346.346 346) 346) 346) 346) 346) 346 3450 345345 
O5 311 311 311 311 311) 311) 311) 311) 311) 310.310.310.310) S310) 309) 309.309 
0.6 278 278 278 277) 277) 277) 277) 277 277277277277 276 276 275 275274 
0.7 246 246 246 246 245) 245) 245) 2450 2450 2452452452440 2440 243 243 242 
0.8 217) 217) 216) 216) 216) 216.216.215.215 215215215215 214.213.213.212 
0.9 190 189 .189 .189 .189 .189 .188 .188 .188 .188 .188 .188 .187 .187 .186 185 .184 
10 165 165 164 164 164 .164 163 .163 .163 .163 .163 163 .162 .162 161 .160 .159 
1.1 4.143) 142) 142) 142) 141.141.141.141) 141140) 140) 140) 139) 139) 138) 137.136 
1.2 122 122 122 121 121) 121) 121) 120) 120) .120) 120) «=.120) «6.119 = .119) 117 116.115 
1.3 105 104 104 104 103 .103 .103 .103 .102 .102 .102 .102 101 101 .099 .098 .097 
14 089 .089 088 .088 .087 .087 .087 .087 .086 .086 .086 .086 .085 .085 .083 .082 .081 
15 075 075 074 074 074 .073. .073) 073.073) 072) 072 072 071 071 .069 .068 .067 
16 .063 063 .062 .062 .062 .061 061 .061 .061 .060 .060 .060 .059 .059 .057 .056 .055 
17 053) 052) 052) 052) 051) = .051) S051) 051.050.050.050 050.049.048.047) 046045 
18 044 043 043 043 042 042 .042 042 .042 .041 041 041 040 .040 .038 .037 .036 
19 .036 .036 .036 .035 035 .035 .035 .034 .034 .034 .034 .034 033 .032 031 .030 .029 
2.0 .030 .030 .029 .029 .029 .028 .028 .028 .028 .028 .027 027 027 026 025 .024 .023 
2.1 025 024 024 .024 .023 .023 .023 .023 .023 .022 .022 .022 022 021 .020 019 .018 
2.2 020 .020 .020 .019 .019 019 .019 .018 .018 .018 018 018 O17 O17) 016 O15 014 
23 016 016 016 016 O15 O15 015 015) 015) 015) 6.014) «63.014 «6.014. S013, 012) 012s O11 
24 013 013) 013) 013) 012 012 012 012 012 012 012 O11 O11 O11 010 .009 008 
25 011 O11 010 010 .010 .010 .010 .010 .009 .009 .009 .009 .009 .008 .008 .007 .006 
2.6 .009 .009 .008 .008 .008 .008 .008 .008 .007 .007 .007 .007 007 .007 .006 .005  .005 
2.7. .007 007 .007 .007 .006 .006 .006 .006 .006 .006 .006 .006 005 005 .004 .004 .003 
2.8 006 .006 005 .005 .005 .005 .005 .005 .005 .005 .005 .004 .004 .004 003 .003 .003 
29 005 .004 .004 .004 .004 .004 .004 .004 .004 .004 .004 .003 .003 .003 .003 .002 .002 
3.0 .004 .004 .003 .003 .003 .003 .003 .003 .003 .003 .003 .003 .002 .002 .002 .002 .001 
3.1 .003 .003 .003 .003 .003 .002 .002 .002 .002 002 .002 .002 .002 .002 .001 .001 .001 
3.2 .002 .002 .002 .002 .002 .002 .002 .002 .002 002 .002 .002 .001 .001 .001 .001 .001 
3.3. 002 .002 .002 .002 .002 .001 .001 .001 001 001 001 001 001 001 .001 .001 .000 
3.4 .002 .001 .001 .001 .001 .001 001 .001 001 001 001 001 .001 001 .001 .000 .000 
3.55  .001 .001 .001 .001 .001 .001 001 .001 001 001 001 001 .001 .001 .000 .000 .000 
3.6 .001 .001 .001 .001 .001 .001 001 001 001 001 001 .001 .000 .000 .000 .000 = .000 
3.7. 001 .001 .001 .001 .001 .001 001 .001 000 .000 .000 .000 .000 .000 .000 .000  .000 
3.8 .001 .001 .001 .000 .000 .000 .000 .000 .000 000 .000 .000 .000 .000 .000 .000  .000 
3.9 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 
40 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 000 .000 .000 .000 = .000 
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Table A.8 Critical values for F distributions 


v= 
denominator 
df 


10 


11 


12 


1 
39.86 
161.45 
4052.2 
405284 


8.53 
18.51 
98.50 

998.50 


5.54 
10.13 
34.12 

167.03 


4.54 
7.71 
21.20 
74.14 


4.06 
6.61 
16.26 
47.18 


3.78 
5.99 
13.75 
35:51 


3.59 
5.59 
12.25 
29.25 


3.46 
5.32 
11.26 
25.41 


3.36 
5.12 
10.56 
22.86 


3.29 
4.96 
10.04 
21.04 


3.23 
4.84 
9.65 
19.69 


3.18 
4.75 
9.33 
18.64 


2 


49.50 
199.50 
4999.5 
500000 


9.00 
19.00 
99.00 

999.00 


5.46 
9.55 
30.82 
148.50 


4.32 
6.94 
18.00 
61.25 


3.78 
5.79 
13.27 
37.12 


3.46 
5.14 
10.92 
27.00 


3.26 
4.74 
9.55 
21.69 


3.11 
4.46 
8.65 
18.49 


3.01 
4.26 
8.02 
16.39 


2.92 
4.10 
7.56 
14.91 


2.86 
3.98 
7.21 
13.81 


2.81 
3.89 
6.93 
12.97 


v,; = numerator df 


é] 
33:59 
215.71 
5403.4 
540379 


9.16 
19.16 
99.17 

999.17 


5.39 
9.28 
29.46 
141.11 


4.19 
6.59 
16.69 
56.18 


3.62 
5.41 
12.06 
33.20 


3.29 
4.76 
9.78 
23.70 


3.07 
4.35 
8.45 
18.77 


2.92 
4.07 
7.59 
15.83 


2.81 
3.86 
6.99 
13.90 


2.73 
3.71 
6.55 
12.55 


2.66 
3.59 
6.22 
11.56 


2.61 
3.49 
5.95 
10.80 


55.83 
224.58 
5624.6 


562500 


9.24 
19.25 
99.25 

999.25 


5.34 
9.12 
28.71 
137.10 


4.11 
6.39 
15.98 
53.44 


S02 


5, 


9 


11.39 
31.09 


5 
57.24 
230.16 
5763.6 
576405 


9.29 
19.30 
99.30 

999.30 


5:31 
9.01 
28.24 
134.58 


4.05 
6.26 
15.52 
51.71 


3.45 
5.05 
10.97 
29.75 


3.11 
4.39 
8.75 
20.80 


2.88 
S197: 
7.46 
16.21 


2.73 
3.69 
6.63 
13.48 


2.61 
3.48 
6.06 
11.71 


2.52 
3.33 
5.64 
10.48 


2.45 
3.20 
5.32 
9.58 


2.39 
3.11 
5.06 
8.89 


6 


58.20 
233.99 
5859.0 
585937 


9:33 
19.33 
99.33 

999.33 


5.28 
8.94 
27.91 
132.85 


4.01 
6.16 
15.21 
50.53 


3.40 
4.95 
10.67 
28.83 


3.05 
4.28 
8.47 
20.03 


2.83 
3.87 
7.19 
15.52 


2.67 
3.58 
6.37 
12.86 


pESR 
S30 
5.80 
11.13 


2.46 
3:22. 
5.39 
9.93 


2.39 
3.09 
5.07 
9.05 


2:33 
3.00 
4.82 
8.38 


7 
58.91 
236.77 
5928.4 
592873 


9:35 
19.35 
99.36 

999.36 


21 
8.89 
27.67 
131.58 


3.98 
6.09 
14.98 
49.66 


3:37 
4.88 
10.46 
28.16 


3.01 
4.21 
8.26 
19.46 


2.78 
3.79 
6.99 
15.02 


2.62 
3.50 
6.18 
12.40 


2.51 
3.29 
5.61 
10.70 


2.41 
3.14 
5.20 
9:52 


2.34 
3.01 
4.89 
8.66 


2.28 
2.91 
4.64 
8.00 


8 
59.44 
238.88 
5981.1 
598144 


9.37 
19.37 
99.37 

999.37 


220) 
8.85 
27.49 
130.62 


395 
6.04 
14.80 
49.00 


3.34 
4.82 
10.29 
27.65 


2.98 
4.15 
8.10 
19.03 


2.75 
3.73 
6.84 
14.63 


2:59 
3.44 
6.03 
12.05 


2.47 
3:23 
5.47 
10.37 


2.38 
3.07 
5.06 
9.20 


2.30 
2.95 
4.74 
8.35 


2.24 
2.85 
4.50 
7.71 


913 


9 
59.86 
240.54 
6022.5 
602284 


9.38 
19.38 
99.39 

999.39 


5.24 
8.81 
27.35 
129.86 


3.94 
6.00 
14.66 
48.47 


3,32 
4.77 
10.16 
27.24 


2.96 
4.10 
7.98 
18.69 


212 
3.68 
6.72 
14.33 


2.56 
3.39 
5.91 
11.77 


2.44 
3.18 
5.35 
10.11 


2.35 
3.02 
4.94 
8.96 


22. 
2.90 
4.63 
8.12 


2.21 
2.80 
4.39 
748 
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Table A.8 (continued) 


10 12 15 

60.19 60.71 61.22 

241.88 243.91 245.95 
6055.8 6106.3 6157.3 


605621 610668 615764 


9.39 9.41 9.42 
19.40 19.41 19.43 
99.40 99.42 99.43 

999.40 999.42 999.43 

5.23 5.22 5.20 

8.79 8.74 8.70 
27.23 27.05 26.87 

129.25 128.32 127.37 

3.92 3.90 3.87 

5.96 5.91 5.86 
14.55 14.37 14.20 
48.05 47.41 46.76 

3.30 3.27 3.24 

4.74 4.68 4.62 
10.05 9.89 9.72 
26.92 26.42 25.91 

2.94 2.90 2.87 

4.06 4.00 3.94 

7.87 di. 7.56 
18.41 17.99 17.56 

2.70 2.67 2.63 

3.64 3:57 3.51 

6.62 6.47 6.31 
14.08 13.71 13.32 

2.54 2.50 2.46 

3,35 3.28 3:22 

5.81 5.67 5.52 
11.54 11.19 10.84 

2.42 2.38 2.34 

3.14 3.07 3.01 

5.26 5.11 4.96 

9.89 9.57 9.24 

2.32 2.28 2.24 

2.98 2.91 2.85 

4.85 4.71 4.56 

8.75 8.45 8.13 

2.25 2.21 2.17 

2.85 2.79 2.72 

4.54 4.40 4.25 

7.92 7.63 7.32 

2.19 2.15 2.10 

2.75, 2.69 2.62 

4.30 4.16 4.01 

7.29 7.00 6.71 


20 


61.74 
248.01 
6208.7 
620908 


9.44 
19.45 
99.45 

999.45 


5.18 
8.66 
26.69 
126.42 


3.84 
5.80 
14.02 
46.10 


3.21 
4.56 
9.55 
25.39 


2.84 
3.87 
7.40 
17.12 


2.59 
3.44 
6.16 
12.93 


2.42 
3:15, 
5.36 
10.48 


2.30 
2.94 
4.81 
8.90 


2.20 
277 
4.41 
7.80 


2.12 
2.65 
4.10 
7.01 


2.06 
2.54 
3.86 
6.40 


25 


62.05 
249.26 
6239.8 
624017 


9.45 
19.46 
99.46 

999.46 


5.17 
8.63 
26.58 
125.84 


3.83 
5.77 
13.91 
45.70 


3.19 
4.52 
9.45 
25.08 


2.81 
3.83 
7.30 
16.85 


2.57 
3.40 
6.06 
12.69 


2.40 
3.11 
5.26 
10.26 


2.27 
2.89 
4.71 
8.69 


2.17 
2.73 
4.31 
7.60 


2.10 
2.60 
4.01 
6.81 


2.03 
2.50 
3.76 
6.22 


30 


62.26 
250.10 
6260.6 
626099 


9.46 
19.46 
99.47 

999.47 


5.17 
8.62 
26.50 
125.45 


3.82 
Pre) 
13.84 
45.43 


3.17 
4.50 
9.38 
24.87 


2.80 
3.81 
7.23 
16.67 


2.56 
3.38 
5.99 
12.53 


2.38 
3.08 
5.20 
10.11 


2.25 
2.86 
4.65 
8.55 


2.16 
2.70 
4.25 
TAT 


2.08 
2.57 
3.94 
6.68 


2.01 
2.47 
3.70 
6.09 


40 


62.53 
251.14 
6286.8 
628712 


9.47 
19.47 
99.47 

999.47 


5.16 
8.59 
26.41 
124.96 


3.80 
5.72 
13.75 
45.09 


3.16 
4.46 
9.29 
24.60 


2.78 
3.77 
7.14 
16.44 


2.54 
3.34 
5.91 
12.33 


2.36 
3.04 
5.12 
9.92 


2.23 
2.83 
4.57 
8.37 


2.13 
2.66 
4.17 
7.30 


2.05 
2,53 
3.86 
6.52 


1.99 
2.43 
3.62 
5.93 


50 


62.69 
251.77 
6302.5 
630285 


9.47 
19.48 
99.48 

999.48 


a5. 
8.58 
26.35 
124.66 


3.80 
5.70 
13.69 
44,88 


3.15 
4.44 
9.24 
24.44 


2.77 
3.75) 
7.09 
16.31 


2.52 
3.32 
5.86 
12.20 


2.35 
3.02 
5.07 
9.80 


2.22 
2.80 
4.52 
8.26 


2.12 
2.64 
4.12 
719 


2.04 
2.51 
3.81 
6.42 


1.97 
2.40 
3.57 
5.83 


60 


62.79 
252.20 
6313.0 
631337 


9.47 
19.48 
99.48 

999.48 


5.15 
8.57 
26.32 
124.47 


3.79 
5.69 
13.65 
44.75 


3.14 
4.43 
9.20 
24.33 


2.76 
3.74 
7.06 
16.21 


2.51 
3.30 
5.82 
12.12 


2.34 
3.01 
5.03 
9.73 


2.21 
2.79 
4.48 
8.19 


2.11 
2.62 
4.08 
TAD 


2.03 
2.49 
3.78 
6.35 


1.96 
2.38 
3.54 
5.76 


120 


63.06 
253.25 
6339.4 
633972 


9.48 
19.49 
99.49 

999.49 


5.14 
8.55 
26.22 
123.97 


3.78 
5.66 
13.56 
44.40 


3.12 
4.40 
9.11 
24.06 


2.74 
3.70 
6.97 
15.98 


2.49 
3.27 
5.74 
11.91 


2.32 
2.97 
4.95 
9.53 


2.18 
2:15: 
4.40 
8.00 


2.08 
2.58 
4.00 
6.94 


2.00 
2.45 
3.69 
6.18 


1.93 
2.34 
3.45 
5.59 


Sppendet 


1000 


63.30 
254.19 
6362.7 
636301 


9.49 
19.49 
99.50 

999.50 


2:13. 
8.53 
26.14 
123.53 


3.76 
5.63 
13.47 
44,09 


311 
4.37 
9.03 
23.82 


2.72 
3.67 
6.89 
15.77 


2.47 
3.23 
5.66 
11.72 


2.30 
2.93 
4.87 
9.36 


2.16 
2.71 
4.32 
7.84 


2.06 
2.54 
3.92 
6.78 


1.98 
2.41 
3.61 
6.02 


1.91 
2.30 
3.37 
5.44 
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Table A.8 (continued) 


v= 
denominator 
df 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


v, = numerator df 


3 


2.56 
3.41 
5.74 

10.21 


2.52 
3.34 
5.56 
9.73 


2.49 
3.29 
5.42 
9.34 


2.46 
3.24 
5.29 
9.01 


2.44 
3.20 
5.19 
8.73 


2.42 
3.16 
5.09 
8.49 


2.40 
3.13 
5.01 
8.28 


2.38 
3.10 
4.94 
8.10 


2.36 
3.07 
4.87 
7.94 


2.35 
3.05 
4.82 
7.80 


2.34 
3.03 
4.76 
7.67 


2.33 
3.01 
4.72 
7.55 


4 


2.43 
3.18 
5.21 
9.07 


2.39 
3.11 
5.04 
8.62 


2.36 
3.06 
4.89 
8.25 


2.33 
3.01 
4.77 
7.94 


2.31 
2.96 
4.67 
7.68 


2.29 
2.93 
4.58 
7.46 


2.27 
2.90 
4.50 
7.27 


2.25 
2.87 
4.43 
7.10 


2.23 
2.84 
4.37 
6.95 


2.22 
2.82 
4.31 
6.81 


2.21 
2.80 
4.26 
6.70 


2.19 
2.78 
4.22 
6.59 


5 


2.35 
3.03 
4.86 
8.35 


2.31 
2.96 
4.69 
7.92 


2.27 
2.90 
4.56 
7.57 


2.24 
2.85 
4.44 
7.27 


2.22 
2.81 
4.34 
7.02 


2.20 
2.77 
4.25 
6.81 


2.18 
2.74 
4.17 
6.62 


2.16 
2.71 
4.10 
6.46 


2.14 
2.68 
4.04 
6.32 


2.13 
2.66 
3.99 
6.19 


211 
2.64 
3.94 
6.08 


2.10 
2.62 
3.90 
5.98 


6 
2.28 
2.92 
4.62 
7.86 


2.24 
2.85 
4.46 
7.44 


2.21 
2:79: 
4.32 
7.09 


2.18 
2.74 
4.20 
6.80 


2.15 
2.70 
4.10 
6.56 


2.13 
2.66 
4.01 
6.35 


2.11 
2.63 
3.94 
6.18 


2.09 
2.60 
3.87 
6.02 


2.08 
2.57 
3.81 
5.88 


2.06 
2.55 
3.76 
5.76 


2.05 
2.53 
3.71 
5.65 


2.04 
2.51 
3.67 
5.55 


7 


2.23 
2.83 
4.44 
7.49 


2.19 
2.76 
4.28 
7.08 


2.16 
2.71 
4.14 
6.74 


2.13 
2.66 
4.03 
6.46 


2.10 
2.61 
3.93 
6.22 


2.08 
2.58 
3.84 
6.02 


2.06 
2.54 
3.77 
5.85 


2.04 
2.51 
3.70 
5.69 


2.02 
2.49 
3.64 
5.56 


2.01 
2.46 
3.59 
5.44 


1.99 
2.44 
3.54 
5.33 


1.98 
2.42 
3.50 
5.23 


8 


2.20 
2.77 
4.30 
7.21 


2.15 
2.70 
4.14 
6.80 


2.12 
2.64 
4.00 
6.47 


2.09 
2.59 
3.89 
6.19 


2.06 
2.55 
3.79 
5.96 


2.04 
2.51 
3,71 
5.76 


2.02 
2.48 
3.63 
5.59 


2.00 
2.45 
3.56 
5.44 


1.98 
2.42 
3,51 
5.31 


1.97 
2.40 
3.45 
5.19 


1.95 
2.37 
3.41 
5.09 


1.94 
2.36 
3.36 
4.99 


2 


9 
2.16 
2.71 
4.19 
6.98 


2.12 
2.65 
4.03 
6.58 


2.09 
2.59 
3.89 
6.26 


2.06 
2.54 
3.78 
5.98 


2.03 
2.49 
3.68 
Sa) 


2.00 
2.46 
3.60 
5.56 


1.98 
2.42 
3:52 
5.39 


1.96 
2.39 
3.46 
5.24 


1.95 
2.37 
3.40 
5.11 


1.93 
2.34 
3:35 
4.99 


1.92 
2.32 
3.30 
4.89 


1.91 
2.30 
3.26 
4.80 
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Table A.8 (continued) 


2.14 
2.67 
4.10 
6.80 


2.10 
2.60 
3.94 
6.40 


2.06 
2.54 
3.80 
6.08 


2.03 
2.49 
3.69 
5.81 


2.00 
2.45 
3.59 
5.58 


1.98 
2.41 
3.51 
5.39 


1.96 
2.38 
3.43 
5.22 


1.94 
2.35 
3.37 
5.08 


1.92 
2.32 
3.31 
4.95 


1.90 
2.30 
3.26 
4.83 


1.89 
DZ. 
3.21 
4.73 


1.88 
2.25 
3.17 
4.64 


2.10 
2.60 
3.96 
6.52 


2.05 
2.53 
3.80 
6.13 


2.02 
2.48 
3.67 
5.81 


1.99 
2.42 
3.55 
5:59 


1.96 
2.38 
3.46 
5.32 


1.93 
2.34 
3.37 
5.13 


1.91 
2.31 
3.30 
4.97 


1.89 
2.28 
3.23 
4.82 


1.87 
2.25 
3.17 
4.70 


1.86 
2.23 
3.12 
4.58 


1.84 
2.20 
3.07 
4.48 


1.83 
2.18 
3.03 
4.39 


2.05 
2.53 
3.82 
6.23 


2.01 
2.46 
3.66 
5.85 


1.97 
2.40 
3.52 
5.54 


1.94 
2:35 
3.41 
5.27 


1.91 
2:31 
3.31 
5.05 


1.89 
2.27 
3.23 
4.87 


1.86 
2.23 
3.15 
4.70 


1.84 
2.20 
3.09 
4.56 


1.83 
2.18 
3.03 
4.44 


1.81 
215 
2.98 
4.33 


1.80 
2.13 
2.93 
4.23 


1.78 
2.11 
2.89 
4.14 


v, = numerator df 
30 


1.96 
2.38 
3.511 
5.63 


1.91 
2.31 
3.35 
5.25 


1.87 
2.25 
3.21 
4.95 


1.84 
2.19 
3.10 
4.70 


1.81 
2.15 
3.00 
4.48 


1.78 
2.11 
2.92 
4.30 


1.76 
2.07 
2.84 
4.14 


1.74 
2.04 
2.78 
4.00 


1.72 
2.01 
2.72 
3.88 


1.70 
1.98 
2.67 
3.78 


1.69 
1.96 
2.62 
3.68 


1.67 
1.94 
2.58 
3.59 


40 


1.93 
2.34 
3.43 
5.47 


1.89 
2.27 
3.27 
5.10 


1.85 
2.20 
3.13 
4.80 


1.81 
2.15 
3.02 
4.54 


1.78 
2.10 
2.92 
4.33 


1.75 
2.06 
2.84 
4.15 


1.73 
2.03 
2.76 
3.99 


1.71 
1.99 
2.69 
3.86 


1.69 
1.96 
2.64 
3.74 


1.67 
1.94 
2.58 
3.63 


1.66 
1.91 
2.54 
3.93 


1.64 
1.89 
2.49 
3.45 


50 


1.92 
2.31 
3.38 
3:37 


1.87 
2.24 
3.22 
5.00 


1.83 
2.18 
3.08 
4.70 


1.79 
2.12 
2.97 
4.45 


1.76 
2.08 
2.87 
4.24 


1.74 
2.04 
2.78 
4.06 


171 
2.00 
2.71 
3.90 


1.69 
1.97 
2.64 
377 


1.67 
1.94 
2.58 
3.64 


1.65 
1.91 
2.53 
3.54 


1.64 
1.88 
2.48 
3.44 


1.62 
1.86 
2.44 
3.36 


60 


1.90 
2.30 
3.34 
5.30 


1.86 
2.22 
3.18 
4.94 


1.82 
2.16 
3.05 
4.64 


1.78 
2.11 
2.93 
4.39 


1.75 
2.06 
2.83 
4.18 


1.72 
2.02 
2.75 
4.00 


1.70 
1.98 
2.67 
3.84 


1.68 
1.95 
2.61 
3.70 


1.66 
1.92 
2.55 
3.58 


1.64 
1.89 
2.50 
3.48 


1.62 
1.86 
2.45 
3.38 


1.61 
1.84 
2.40 
3.29 


120 


1.88 
2.25 
3.25 
5.14 


1.83 
2.18 
3.09 
4.77 


1.79 
2.11 
2.96 
4.47 


1.75 
2.06 
2.84 
4.23 


1.72 
2.01 
2.75 
4.02 


69 
OT 
66 
84 


wn 


67 
.93 
58 
68 


wn 


.64 
-90 
52 
54 


wn 


62 
87 
46 
A2 


wn 


-60 
84 
40 
32 


wn 


59 
81 
35 
22. 


ea) 


‘7 
“19 
2.31 
3.14 


1000 


1.85 
2.21 
3.18 
4.99 


-80 
2.14 
3.02 
4.62 


-76 
07 
88 
4.33 


NN 


eke, 
2.02 
-76 
08 


AN 


69 
OT 
66 
87 


wr 


66 
92 
58 
69 


wr 


64 
88 
50 
53 


wn 


61 
85 
43 
40 


wn 


59 
82 
37 
28 


wn 


‘7 
19) 
32 
AT 


wn 


55 
-76 
27 
08 


wn 


54 
.74 
2.22 
2.99 
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Table A.8 (continued) 


v= 
denominator 
df 


25 


26 


27 


28 


29 


30 


40 


50 


60 


100 


200 


1000 


2.53 
3.39 
5.57 
9.22 


2.52 
3:37 
3:53) 
9.12 


2.51 
3.35 
5.49 
9.02 


2.50 
3.34 
5.45 
8.93 


2.50 
3.33 
5.42 
8.85 


2.49 
3.32 
5.39 
8.77 


2.44 
3.23 
5.18 
8.25 


2.41 
3.18 
5.06 
7.96 


2.39 
3.15 
4.98 
7.77 


2.36 
3.09 
4.82 
TAL 


2:33: 
3.04 
4.71 
TAS 


2.31 
3.00 
4.63 
6.96 


v,; = numerator df 


3 


2.32 
2.99 
4.68 
TAS 


2.31 
2.98 
4.64 
7.36 


2.30 
2.96 
4.60 
7.27 


2.29 
2.95 
4.57 
719 


2.28 
2.93 
4.54 
7.12 


2.28 
2.92 
4.51 
7.05 


2.23 
2.84 
4.31 
6.59 


2.20 
2.79 
4.20 
6.34 


2.18 
2.76 
4.13 
6.17 


2.14 
2.70 
3.98 
5.86 


2.11 
2.65 
3.88 
5.63 


2.09 
2.61 
3.80 
5.46 


4 
2.18 
2.76 
4.18 
6.49 


2.17 
2.74 
4.14 
6.41 


2.17 
2.73 
4.11 
6.33 


2.16 
2.71 
4.07 
6.25 


2.15 
2.70 
4.04 
6.19 


2.14 
2.69 
4.02 
6.12 


2.09 
2.61 
3.83 
5.70 


2.06 
2.56 
3.12. 
5.46 


2.04 
2.53 
3.65 
5.31 


2.00 
2.46 
3.51 
5.02 


1.97 
2.42 
3.41 
4.81 


1.95 
2.38 
3.34 
4.65 


3.30 


9 
1.89 
2.28 
3,22 
4.71 


1.88 
2.27 
3.18 
4.64 


1.87 
2.25 
3.15 
4.57 


1.87 
2.24 
3.12 
4.50 


1.86 
2.22 
3.09 
4.45 


1.85 
2.21 
3.07 
4.39 


179 
2.12 
2.89 
4.02 


1.76 
2.07 
2.78 
3.82 


1.74 
2.04 
2.72 
3.69 


1.69 
1.97 
2.59 
3.44 


1.66 
1.93 
2.50 
3.26 


1.64 
1.89 
2.43 
3.13 
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Table A.8 (continued) 


1.72 
2.01 
2.70 
3.79 


71 
99 
2.66 
3.72 


-70 
OT 
2.63 
3.66 


69 
96 
2.60 
3.60 


68 
94 
2.57 
3.54 


67 
.93 
255 
3.49 


61 
84 
2.37 
3.14 


0 
-78 
2.27 
2.95 


54 
75 
2.20 
2.83 


49 
68 
2.07 
2.59 


46 
62 
OT 
2.42 


43 
58 
90 
2.30 


2.14 


v,; = numerator df 
30 

66 

92 

2.54 

3:52. 


.65 
90 
2.50 
3.44 


.64 
88 
2.47 
3.38 


-63 
87 
2.44 
3.32 


62 
85 
2.41 
3.27 


61 
84 
2.39 
3.22 


54 
.74 
2.20 
2.87 


50 
69 
2.10 
2.68 


A8 
-65 
2.03 
2,95 


A2 
7 
89 
2.32 


38 
52 
79 
2.15 


35 
AT 
72, 
2.02 


40 


.63 
87 
2.45 
3.37 


61 
85 
2.42 
3.30 


.60 
.84 
2.38 
3.23 


59 
82 
2.35 
3.18 


58 
81 
2.33 
3.12 


oT 
79 
2.30 
3.07 


Sl 
69 
2.11 
2.73 


46 
.63 
2.01 
2,53 


44 
59 
94 
2.41 


38 
52 
.80 
2.17 


34 
46 
69 
2.00 


30 
Al 
61 
87 


1.61 
1.84 
2.40 
3.28 


1.59 
1.82 
2.36 
3.21 


1.58 
1.81 
2.33 
3.14 


1.57 
1.79 
2.30 
3.09 


1.56 
1.77 
2.27 
3.03 


1.55 
1.76 
2.25 
2.98 


1.48 
1.66 
2.06 
2.64 


1.44 
1.60 
1.95 
2.44 


1.41 
1.56 
1.88 
2.32 


1.35 
1.48 
1.74 
2.08 


1.31 
1.41 
1.63 
1.90 


1.27 
1.36 
1.54 
1.77 


120 


1.56 
77 
2.27 
3.06 


1.54 
1,75 
2.23 
2.99 


1,53 
1,73 
2.20 
2.92 


1.52 
1.71 
2.17 
2.86 


1.51 
1.70 
2.14 
2.81 


1.50 
1.68 
2.11 
2.76 


1.42 
1.58 
1.92 
2.41 


1.38 
1.51 
1.80 
2.21 


1.35 
1.47 
1.73 
2.08 


1.28 
1.38 
57 
1.83 


1.23 
1.30 
1.45 
1.64 


1.18 
1.24 
1.35 
1.49 
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1000 


1,52 
1.72 
2.18 
2.91 


1.51 
1.70 
2.14 
2.84 


1.50 
1.68 
2.11 
2.78 


1.48 
1.66 
2.08 
2.72 


1.47 
1.65 
2.05 
2.66 


1.46 
1.63 
2.02 
2.61 


1.38 
1.52 
1.82 
2.25 


1.33 
1.45 
1.70 
2.05 


1.30 
1.40 
1.62 
1.92 


1.22 
1.30 
1.45 
1.64 


1.16 
1.21 
1.30 
1.43 


1.08 
1.11 
1.16 
1,22 
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Table A.9 Critical values for studentized range distributions 


v 


3.64 
5.70 
3.46 
5.24 
3.34 
4.95 
3.26 
4.75 
3.20 
4.60 
3.15 
4.48 
3.11 
4.39 
3.08 
4.32 
3.06 
4.26 
3.03 
4.21 
3.01 
4.17 
3.00 
4.13 
2.98 
4.10 
2.97 
4.07 
2.96 
4.05 
2.95 
4.02 
2.92 
3.96 
2.89 
3.89 
2.86 
3.82 
2.83 
3.76 
2.80 
3.70 
277 
3.64 


4.60 
6.98 
4.34 
6.33 
4.16 
5.92 
4.04 
5.64 
3.95 
5.43 
3.88 
5.27 
3.82 
5:15 
3.77 
5.05 
3.73 
4.96 
3.70 
4.89 
3.67 
4.84 
3.65 
4.79 
3.63 
4.74 
3.61 
4.70 
3.59 
4.67 
3.58 
4.64 
3.53 
4.55 
3.49 
4.45 
3.44 
4.37 
3.40 
4.28 
3.36 
4.20 
3.31 
4.12 


4 
5.22 
7.80 
4.90 
703 
4.68 
6.54 
4.53 
6.20 
4.41 
5.96 
4.33 
ST 
4.26 
5,62 
4.20 
5.50 
4.15 
5.40 
4.11 
5.32 
4.08 
5.25 
4.05 
5.19 
4.02 
5.14 
4.00 
5.09 
3.98 
5.05 
3.96 
5.02 
3.90 
4.91 
3.85 
4.80 
3.79 
4.70 
3.74 
4.59 
3.68 
4.50 
3.63 
4.40 


5 
5.67 
8.42 
5,30 
7.56 
5.06 
7.01 
4.89 
6.62 
4.76 
6.35 
4.65 
6.14 
4.57 
5.97 
4.51 
5.84 
4.45 
5.73. 
4.41 
5.63 
4.37 
5.56 
4.33 
5.49 
4.30 
5.43 
4.28 
5.38 
4.25 
5:33 
4.23 
5.29 
4.17 
5.17 
4.10 
5.05 
4.04 
4.93 
3.98 
4.82 
3.92 
4.71 
3.86 
4.60 


m 


6 


6.03 
8.91 
5.63 
7.97 
5.36 
7.37 
5.17 
6.96 
5.02 
6.66 
4.91 
6.43 
4.82 
6.25 
4.75 
6.10 
4.69 
5.98 
4.64 
5.88 
4.59 
5.80 
4.56 
5.72 
4.52 
5.66 
4.49 
5.60 
4.47 
5.55 
4.45 
5.51 
4.37 
5.37 
4.30 
5.24 
4.23 
5.11 
4.16 
4.99 
4.10 
4.87 
4.03 
4.76 


7 


6.33 
9.32 
5.90 
8.32 
5.61 
7.68 
5.40 
7.24 
5.24 
6.91 
5.12 
6.67 
5.03 
6.48 
4.95 
6.32 
4.88 
6.19 
4.83 
6.08 
4.78 
5.99 
4.74 
5.92 
4.70 
5.85 
4.67 
5.79 
4.65 
5.73 
4.62 
5.69 
4.54 
5.54 
4.46 
5.40 
4.39 
5.26 
4.31 
5.13 
4.24 
5.01 
4.17 
4.88 


8 


6.58 
9.67 
6.12 
8.61 
5.82 
7.94 
5.60 
7A7 
5.43 
7.13 
5.30 
6.87 
5.20 
6.67 
5.12 
6.51 
5.05 
6.37 
4.99 
6.26 
4.94 
6.16 
4.90 
6.08 
4.86 
6.01 
4.82 
5.94 
4.79 
5.89 
4.77 
5.84 
4.68 
5.69 
4.60 
5.54 
4.52 
5.39 
4.44 
5:25 
4.36 
5.12 
4.29 
4.99 


9 
6.80 
9.97 
6.32 
8.87 
6.00 
8.17 
a7 
7.68 
5.59 
7.33 
5.46 
7.05 
5.35 
6.84 
3.27 
6.67 
5.19 
6.53 
3.13 
6.41 
5.08 
6.31 
5.03 
6.22 
4.99 
6.15 
4.96 
6.08 
4.92 
6.02 
4.90 
3.97 
4.81 
5.81 
4.72 
5.65 
4.63 
5.50 
4.55 
5.36 
4.47 
5.21 
4.39 
5.08 


10 
6.99 
10.24 
6.49 
9.10 
6.16 
8.37 
5.92 
7.86 
5.74 
7.49 
5.60 
7.21 
5.49 
6.99 
5.39 
6.81 
5.32 
6.67 
5.25 
6.54 
5.20 
6.44 
5.15 
6.35 
5.11 
6.27 
5.07 
6.20 
5.04 
6.14 
5.01 
6.09 
4.92 
5.92 
4.82 
5.76 
4.73 
5.60 
4.65 
5.45 
4.56 
5.30 
4.47 
5.16 


11 
TAT 
10.48 
6.65 
9.30 
6.30 
8.55 
6.05 
8.03 
5.87 
7.05 
5.72 
7.36 
5.61 
7.13 
S51 
6.94 
5.43 
6.79 
5.36 
6.66 
5.31 
6.55 
5.26 
6.46 
5.21 
6.38 
S17 
6.31 
5.14 
6.25 
311 
6.19 
5.01 
6.02 
4.92 
5.85 
4.82 
5.69 
4.73 
3.93: 
4.64 
537 
4.55 
5.23 


919 


12 


7.32 
10.70 
6.79 
9.48 
6.43 
8.71 
6.18 
8.18 
5.98 
7.78 
5.83 
7.49 
5.71 
7.25 
5.61 
7.06 
5:53 
6.90 
5.46 
6.77 
5.40 
6.66 
3.35 
6.56 
5.31 
6.48 
5.27 
6.41 
5.23 
6.34 
5.20 
6.28 
5.10 
6.11 
5.00 
5.93 
4.90 
5.76 
4.81 
5.60 
4.71 
5.44 
4.62 
5.29 
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Table A.10 Chi-squared curve tail areas 


Upper-tail area v=l1 v= v= v=4 v= 
>.100 <2.70 <4.60 <6.25 <7.77 <9.23 
.100 2.70 4.60 6.25 7.71 9.23 
.095 2.78 4.70 6.36 7.90 9.37 
.090 2.87 4.81 6.49 8.04 9.52 
.085 2.96 4.93 6.62 8.18 9.67 
.080 3.06 5.05 6.75 8.33 9.83 
.075 3.17 5.18 6.90 8.49 10.00 
.070 3.28 5.31 7.06 8.66 10.19 
.065 3.40 5.46 7.22 8.84 10.38 
.060 3.53 5.62 7.40 9.04 10.59 
.055 3.68 5.80 7.60 9.25 10.82 
.050 3.84 5.99 7.81 9.48 11.07 
045 4.01 6.20 8.04 9.74 11.34 
.040 4.21 6.43 8.31 10.02 11.64 
.035 4.44 6.70 8.60 10.34 11.98 
.030 4.70 7.01 8.94 10.71 12.37 
025 5.02 7.37 9.34 11.14 12.83 
.020 5.41 7.82 9.83 11.66 13.38 
.015 5.91 8.39 10.46 12.33 14.09 
.010 6.63 9.21 11.34 13.27 15.08 
.005 7.87 10.59 12.83 14.86 16.74 
.001 10.82 13.81 16.26 18.46 20.51 
<.001 >10.82 >13.81 >16.26 >18.46 >20.51 
Upper-tail area v=6 v=7 v=8 v=9 v=10 
>.100 <10.64 <12.01 <13.36 <14.68 <15.98 
.100 10.64 12.01 13.36 14.68 15.98 
095 10.79 12.17 13.52 14.85 16.16 
.090 10.94 12.33 13.69 15.03 16.35 
.085 11.11 12.50 13.87 15.22 16.54 
.080 11.28 12.69 14.06 15.42 16.75 
.075 11.46 12.88 14.26 15.63 16.97 
.070 11.65 13.08 14.48 15.85 17.20 
.065 11.86 13.30 14.71 16.09 17.44 
.060 12.08 13.53 14.95 16.34 17.71 
.055 12.33 13.79 15.22 16.62 17.99 
.050 12.59 14.06 15.50 16.91 18.30 
.045 12.87 14.36 15.82 17.24 18.64 
.040 13.19 14.70 16.17 17.60 19.02 
.035 13.55 15.07 16.56 18.01 19.44 
.030 13.96 15.50 17.01 18.47 19.92 
.025 14.44 16.01 17.53 19.02 20.48 
.020 15.03 16.62 18.16 19.67 21.16 
.015 15.77 17.39 18.97 20.51 22.02 
.010 16.81 18.47 20.09 21.66 23.20 
.005 18.54 20.27 21.95 23.58 25.18 
001 22.45 24.32 26.12 27.87 29.58 
<.001 >22.45 >24.32 >26.12 >27.87 >29.58 


(continued) 
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Table A.10 (continued) 


Upper-tail area vell 
>.100 <17.27 
.100 17.27 
095 17.45 
.090 17.65 
085 17.85 
080 18.06 
075 18.29 
.070 18.53 
.065 18.78 
.060 19.06 
055 19.35 
050 19.67 
045 20.02 
.040 20.41 
.035 20.84 
.030 21.34 
025 21.92 
.020 22.61 
015 23.50 
.010 24.72 
005 26.75 
.001 31.26 
<.001 >31.26 
Upper-tail area v=16 
>.100 <23.54 
.100 23.54 
095 23.75 
.090 23.97 
.085 24.21 
.080 24.45 
075 24.71 
.070 24.99 
065 25.28 
.060 25.59 
055 25.93 
050 26.29 
045 26.69 
.040 27.13 
.035 27.62 
.030 28.19 
025 28.84 
.020 29.63 
O15 30.62 
010 32.00 
005 34.26 
001 39.25 
<.001 >39.25 
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Table A.11 Critical values for the Wilcoxon signed-ranked test 


Po(S4 >c1) = P(S; >c; when Ap is true) 


n cy Po(Ss = ci) n an Po(Ss = cy) 
3 6 125 78 O11 
4 9 125 719 .009 
10 .062 81 .005 
5 13 .094 14 73 .108 
14 .062 74 .097 
15 .031 719 .052 
6 17 109 84 .025 
19 .047 89 .010 
20 .031 92 .005 
21 .016 15 83 .104 
7 22 109 84 .094 
24 .055 89 .053 
26 .023 90 .047 
28 .008 95 024 
8 28 .098 100 O11 
30 .055 101 .009 
32 .027 104 .005 
34 .012 16 93 .106 
35 .008 94 .096 
36 .004 100 .052 
9 34 102 106 .025 
37 .049 112 O11 
39 .027 113 .009 
42 .010 116 .005 
44 .004 17 104 .103 
10 41 .097 105 .095 
44 053 112 .049 
47 .024 118 .025 
50 .010 125 .010 
52 .005 129 .005 
11 48 103 18 116 .098 
52 .051 124 .049 
55 .027 131 .024 
59 .009 138 .010 
61 .005 143 .005 
12 56 102 19 128 .098 
60 .055 136 .052 
61 .046 137 .048 
64 .026 144 .025 
68 .010 152 .010 
71 .005 157 .005 
13 64 108 20 140 101 
65 .095 150 .049 
69 .055 158 .024 
70 .047 167 .010 


74 .024 172 .005 
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Table A.12 Critical values for the Wilcoxon signed-rank interval 


10 


11 


12 


Confidence level (%) 
93.8 
87.5 
96.9 
93.7 
90.6 
98.4 
95.3 
89.1 
99.2 
94.5 
89.1 
99.2 
94.5 
90.2 
99.0 
95.1 
89.5 
99.0 
94.6 
89.8 
99.1 
94.8 
90.8 


Cc 
15 
14 
21 
20 
19 
28 
26 
24 
36 
32 
30 
44 
39 
37 
52 
47 
44 
61 
55 
52 
71 
64 
61 


n 


17 


18 


19 


Confidence level (%) 
99.0 
95.2 
90.6 
99.1 
95.1 
89.6 
99.0 
95.2 
90.5 
99.1 
94.9 
89.5 
99.1 
94.9 
90.2 
99.0 
95.2 
90.1 
99.1 
95.1 
90.4 


21 


22 


23 


24 


25 
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(X(nn$1)/2)—c+1 Xo) 


Confidence level (%) 
99.1 
95.2 
90.3 
99.0 
95.0 
89.7 
99.0 
95.0 
90.2 
99.0 
95.2 
90.2 
99.0 
95.1 
89.9 
99.0 
95.2 
89.9 


c 
173 
158 
150 
188 
172 
163 
204 
187 
178 
221 
203 
193 
239 
219 
208 
257 
236 
224 
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Table A.13 Critical values for the Wilcoxon rank-sum test 


Po(W >c) = P(W>c when A) is true) 


m n c Po(W>c) m n c Po(W>c) 
3 3 15 05 40 .004 
4 17 .O57 6 40 041 
18 029 41 .026 
5 20 .036 43 .009 
21 018 44 .004 
6 22 048 7 43 053 
23 .024 45 .024 
24 012 47 .009 
7 24 058 48 .005 
26 O17 8 47 .047 
27 008 49 023 
8 27 042 51 .009 
28 .024 52 005 
29 O12 6 6 50 .047 
30 .006 52 021 
4 4 24 .O57 54 .008 
25 .029 55 .004 
26 014 7 54 051 
5 27 056 56 .026 
28 032 58 O11 
29 016 60 .004 
30 .008 8 58 054 
6 30 .O57 61 021 
32 019 63 OL 
33 .010 65 .004 
34 .005 7 7 66 .049 
7 33 055 68 .027 
35 021 71 .009 
36 O12 72 .006 
37 .006 8 all 047 
8 36 055 73 .027 
38 024 76 01 
40 .008 78 .005 
41 .004 8 8 84 052 
5 5 36 048 87 .025 
37 028 90 OL 


39 .008 92 .005 
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Table A.14 Critical values for the Wilcoxon rank-sum interval 


(dij(nn—c-+1) + dij(c)) 


Smaller sample size 


Larger sample size 5 6 7 8 
Confidence level (%) c Confidence level (%) c Confidence level (%) c Confidence level (%) c 
5 99.2 25 
94.4 22 
90.5 21 
6 99.1 29 99.1 34 
94.8 26 95.9 31 
91.8 25 90.7 29 
7 99.0 33 99.2 39 98.9 44 
95.2: 30 94.9 35 94.7 40 
89.4 28 89.9 33 90.3 38 
8 98.9 37 99.2 44 99.1 50 99.0 56 
95.5 34 95.7 40 94.6 45 95.0 51 
90.7 32 89.2 37 90.6 43 89.5 48 
9 98.8 41 99.2 49 99.2 56 98.9 62 
95.8 38 95.0 44 94.5 50 95.4 oT 
88.8 35 91.2 42 90.9 48 90.7 54 
10 99.2 46 98.9 53 99.0 61 99.1 69 
94.5 41 94.4 48 94.5 55 94.5 62 
90.1 39 90.7 46 89.1 52 89.9 59 
11 99.1 50 99.0 58 98.9 66 99.1 75 
94.8 45 95.2 53 95.6 61 94.9 68 
91.0 43 90.2 50 89.6 57 90.9 65 
12 99.1 54 99.0 63 99.0 72 99.0 81 
95.2 49 94.7 57 95.5 66 95.3 74 
89.6 46 89.8 54 90.0 62 90.2 70 
Smaller sample size 
Larger sample size 9 10 11 12 
Confidence level (%) c Confidence level (%) c Confidence level (%) c Confidence level (%) c 
9 98.9 69 
95.0 63 
90.6 60 
10 99.0 76 99.1 84 
94.7 69 94.8 76 
90.5 66 89.5 72 
11 99.0 83 99.0 91 98.9 99 
95.4 76 94.9 83 95.3 91 
90.5 72 90.1 79 89.9 86 
12 99.1 90 99.1 99 99.1 108 99.0 116 
95.1 82 95.0 90 94.9 98 94.8 106 


90.5 78 90.7 86 89.6 93 89.9 101 


Chapter 1 


a. Houston Chronicle, Des Moines Regis- 
ter, Chicago Tribune, Washington Post 

b. Capital One, Campbell Soup, Merrill 
Lynch, Pulitzer 

c. Bill Jasper, Kay Reinke, Helen Ford, 
David Menendez 

d. 1.78, 2.44, 3.50, 3.04 


a. In a sample of 100 phones, what are the 
chances that more than 20 need service 
while under warranty? What are the 
chances than none need service while 
still under warranty? 

b. What proportion of all phones of this 
brand and model will need service within 
the warranty period? 


a. Two variables (at least) were recorded: 
skin color and hourly wages. 

b. Skin color is categorical (with four cate- 
gories), while hourly wages is quantitative 
(units: $/h). 


a.categorical _b. quantitative 


c.categorical d. categorical 


e. categorical 


a. No, the relevant conceptual population is 
all scores of all students who participate 
in the SI in conjunction with this partic- 
ular statistics course. 

b. The advantage to randomly assigning 
students to the two groups is that the two 


11. 


13. 


15. 


groups should then be fairly comparable 
before the study. If the two groups per- 
form differently in the class, we can 
reasonably attribute this to the treatments 
(SI and control). If it were left to students 
to choose, stronger or more dedicated 
students might gravitate toward SI, con- 
founding the results. 

c. If all students were put in the treatment 
group there would be no results with 
which to compare the treatments. 


One could generate a simple random sample 
of all single-family homes in the city or a 
stratified random sample by taking a simple 
random sample from each of the 10 district 
neighborhoods. From each of the homes in 
the sample the necessary variables would be 
collected. This would be an enumerative 
study because there exists a finite, identifiable 
population of objects from which to sample. 


a. There could be several explanations for 
the variability of the measurements. 
Among them could be measuring error, 
(due to mechanical or technical changes 
across measurements), recording error, 
differences in weather conditions at time 
of measurements, etc. 


b. This could be called conceptual because 


there is no sampling frame. 


This display brings out the gap in the data: 
There are no scores in the high 70s. 


© The Editor(s) (Gif applicable) and The Author(s), under exclusive license 
to Springer Nature Switzerland AG 2021 

J. L. Devore et al., Modern Mathematical Statistics with Applications, 
Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55 156-8 
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Answers to Odd-Numbered Exercises 


17. a. 


19. 


6L 034 

6H 667899 

7L 00122244 

7H Stem = tens 
8L 001111122344 Leaf = ones 
8H 5557899 

OL 03 

9H 58 


123333333444444 
55555667788888999 
0000001111224 
5789 

0112 

6 

334 

7 


Stem: tens digit 
Leaf: ones digit 


APAYANADANANFHWWNNKFK COO 
i) 


. Arguably, a representative crack depth 


might be around 9-10 pum. 

This is somewhat subjective, but the 
display appears quite spread out. 

No, the distribution is certainly not 
symmetric. Rather, crack depths appear 
to be strongly positively skewed. 


. Yes: All of the values 66.5, 76.1, and 


81.1 jum appear to be high outliers. 
(Using an outlier convention described 
later in the chapter, even the values in the 
50s would be considered outliers!) 


American French 
8 ]1 
755543211000 9 {00234566 
9432 10 |2356 
6630 11 |1369 
850 12 |223558 
8 lb aa a 
14 
15 |8 
2 16 Stem:tens digit 
Leaf: ones digit 


21. 


23. 
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The American distribution is positively 
skewed, but the French distribution is 
fairly symmetric. Almost half of the 
American movies are in the 90s, but the 
French movies are more spread out. 


* Value Freq. Rel. Freq. 
(=Freq./60) 
0 7 117 
1 12 .200 
2 13 217 
3 14 .233 
4 6 .100 
5 3 050 
6 3 050 
7 1 O17 
8 1 O17 


Note Relative frequencies add to 1.001, not 1, 
due to rounding. 


b. The number of batches with at most 5 


nonconforming items is 7 + 12 + 13 + 14 
+ 6 + 3 = 55, which is a proportion of 
55/60 = .917. The proportion of batches 
with (strictly) fewer than 5 noncon- 
forming items is 52/60 = .867. 

Notice that these proportions could also 
have been computed by using the relative 
frequencies: e.g., proportion of batches 
with 5 or fewer nonconforming items = 
1 —(.05 + .017 + .017) = .916; proportion 
of batches with fewer than 5 noncon- 
forming items = 1 - 
(.05 + .05 + .017 + .017) = .866. 


. The center of the histogram is some- 


where around 2 or 3 and it shows that 
there is some positive skewness in the 
data. The histogram also shows that there 
is a lot of spread/variation in this data. 


. 589/1570 = .375. 


1 — (589 + 190 + 176 + 157 + 115)/1570 


= 218. 
c. (115 + 89 + 57 + 55 + 33 + 31)/1570 = 


242. 
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25. 


ene 


29. 


31. 


a. 


a. 


b. 


a. 


. The herd 


size distribution in the 
accompanying histogram is extremely 
positively skewed. 


From a histogram, the number of subdi- 
visions having no cul-de-sacs (i.e., y = 0) 
is 17/47 = .362, or 36.2%. The proportion 
having at least one cul-de-sac (y > 1) is 
(47 — 17)/47 = 30/47 = .638, or 63.8%. 
Note that subtracting the number of cul- 
de-sacs with y = 0 from the total, 47, is an 
easy way to find the number of subdivi- 
sions with y > 1. 


. From a histogram, the number of subdi- 


visions with at most 5 intersections (i.e., 
Zz < 5) is 42/47 = .894, or 89.4%. The 
proportion having fewer than 5 intersec- 
tions (z < 5) is 39/47 = .830, or 83.0%. 


The distribution of these by-state values 
is slightly positively skewed with one 
extremely high outlier (Washington DC, 
54.6%) and two other potential outliers 
(Massachusetts, 40.5% and West Virginia, 
19.2%). The “typical” state percentage 
appears to be between 25 and 30%. 


. No: Since the population sizes of the 50 


states + DC are not equal, the mean of 
these percentages would not equal the 
overall percentage. (If we knew all 51 
population sizes, we could take the 
appropriate weighted average, effectively 
re-constructing the total count of people 
with 4-year degrees and dividing by the 
total population size.) 


The transformation substantially changes 
the shape of the histogram. In particular, 
while the original variable x = number of 
defects was strongly positively skewed 
with an outlier, logjo(x) is reasonably 
symmetrically distributed with no outlier. 


7% of 464 students is roughly (.07)(464) 
= 32.48, or 32 students. [32/464 = .069, 
which rounds to .07.] 


. 18% + 6% + 5% = 29%. 
. No. Without an upper bound on the last 


category, we can’t even make a density 
histogram of the data, because we don’t 
know where the last rectangle should end. 


Answers to Odd-Numbered Exercises 


33. a. The distribution is skewed to the right, or 


35. 


a. 


positively skewed. There is a gap in the 
histogram, and what appears to be an 
outlier in the 500 —< 550 interval. 


Class Frequency _ Relative 

interval frequency 
0 —< 50 9 0.18 
50-< 100 19 0.38 
100-< 150s 11 0.22 
150 —< 200 4 0.08 
200 —< 250 2 0.04 
250 —< 300 2 0.04 
300 —< 350 1 0.02 
350 —< 400 1 0.02 
400 —< 450 0 0.00 
450 —< 500 0 0.00 
500 —< 550 1 0.02 
50 1.00 


. The distribution of the natural logs of the 


original data is much more symmetric 
than the original. 


Class interval Frequency __ Relative 
frequency 
2.25 —< 2.75 2 0.04 
2.15 —< 3.25 2 0.04 
3.25'=<3.75 3 0.06 
3.75 —< 4.25 8 0.16 
4.25 -< 4.75 18 0.36 
4.75 —< 5.25 10 0.20 
929° 5:75 4 0.08 
5.75 < 6.25 3 0.06 


. The proportion of lifetime observations in 


this sample that are less than 100 is .18 + 
.38 = .56, and the proportion that are at least 
200 is .04 + .04 + .02 + .02 + .02 = .14. 


The variable here is helmet status, a 
categorical variable. Its possible values 
are no helmet, noncompliant helmet, and 
compliant helmet. 


Category Frequency Relative 
frequency 

No helmet 731 43 
Noncompliant 153 .09 
helmet 

Compliant 816 48 
helmet 

Total 1700 1.00 


c. .09 + .48 = .57. 


Answers to Odd-Numbered Exercises 


39. 


41. 


43. 


a. 


a. 


The relative frequency distribution is as 
follows. The relative frequency distribu- 
tion is almost unimodal and exhibits a 
large positive skew. The typical middle 
value is somewhere between 400 and 450, 
although the skewness makes it difficult to 
pinpoint more exactly than this. 


Class Rel. — Class Rel. 
Freq. Freq. 
0-< 150 .193 1050 -< 1200 .029 
150 -< 300 = .183.s_:« 1200 —< 1350 .005 
300 -< 450 =.251 = 1350 -< 1500 .004 
450 -< 600 =.148 ~=—1500 -< 1650 .001 
600 -< 750 .097 1650 -< 1800 .002 
750 -< 900 .066 1800-< 1950 .002 
900 -—< 1050 .019 


. The proportion of the fire loads less than 


600 is .193 + .183 + .251 + .148 = .775 
(the cumulative proportion for 600). The 
proportion of loads that are at least 1200 
is .005 + .004 + .001 + .002 + .002 = 
.014 (the opposite of the cumulative 
proportion for 1200). 


. The proportion of loads between 600 and 


1200 is 1 — .775 — .014 = .211. 


.¥= (5424 ---+5+40)/10 = 3.5 yd. 
. The two middle values in order are 2 and 


2, so x = 2 yards. Todd Gurley’s mean 
rushing gain is artificially increased by 
the one 16-yard gain, while the median 
ignores this extreme value. 


. Deleting the 16-yard gain and the 1-yard 


loss (-1) amounts to trimming 1/10 
observations from each end. So, we’re 
talking about the 10% trimmed mean, and 
the average of the remaining 8 values is 
Xtr(10) = 2-5 yards. As is typically the case, 
the trimmed mean falls between the 
median (2 yards) and the mean (3.5 yards). 


With the one very high outlier (Wall 
Street Journal at over 2.2 million), we 
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anticipate that the mean will be higher 
than the median. 

b. x = 4(2237601 + --- + 196286) = 403,456. 
In order, the middle two values are 
285,129 and 276,445, so x = 1(285129+ 
276445) = 280,787. Sure enough, the 
median circulation for the top 20 news- 
papers is substantially less than the mean, 
due to the one extremely high outlier. 


45. Using software, += 92, X25) = 95.07, 
X10) = 102.23, x = 119.3. The mean is 
somewhat larger because of positive skew- 
ness. Trimming results in a value between the 
mean and median, and additional trimming 
gives a value closer to the median. 


47. a. The reported values are (in increasing 
order) 110, 115, 120, 120, 125, 130, 130, 
135, and 140. Thus the median of the 
reported values is 125. 
b. 127.6 is reported as 130, so the median is 
now 130, a very substantial change. 
When there is rounding or grouping, the 
median can be highly sensitive to small 
change. 


49. The mean cannot be calculated, because we 
need the exact value of the two 100+ obser- 
vations. We can, however, compute median = 
(57 + 79) /2 = 68.0, 20% trimmed mean = 
66.2, 30% trimmed mean = 67.5. 


51. a. Manufacturer is a categorical variable. 

b. Since Honda is the most frequent man- 
ufacturer, arguably Honda is the most 
representative “value” of this categorical 
variable. 

c. No. Any numerical coding of these six 
categories artificially imposes an order 
on the manufacturers. For instance, 
sorting alphabetically and sorting by 
popularity would result in different cod- 
ings and thus different means and medi- 
ans. Only the mode (i.e., part b) makes 
sense as a representative value. 


range = 49.3 — 23.5 = 25.8. 

b. Lx; = 310.3, ¥ = 31.03, Sy = D(x; — x) 

= 443.801, Ex? = 10,072.41, s? = 2 
= 45801 — 49.3112. 

ce. s = V49.3112 = 7.022. 


2 Yxt-(Ex)*/n __ 10,072.41—(310.3)"/10 
d. s° = 7 


n—1 = 
= 49.3112. 

55. a. X= Lx;/n = 14438/5 = 2887.6. The 
sorted values are: 2781 2856 2888 2900 
3013, so the sample median is x = 2888. 
b. Subtracting a constant from each obser- 
vation shifts the data, but does not change 

its sample variance. For example, by 
subtracting 2700 from each observation 
we get the values 81, 200, 313, 156, and 
188, which are smaller (fewer digits) and 
easier to work with by hand. The sum of 
squares of this transformed data is 
204210 and its sum is 938, so the com- 
putational formula for the variance gives 

s* = [204210 — (938)°/5](5 — 1) = 7060.3. 


57. s = 24.4. In general, the size of a typical 
deviation from the sample mean (370.7 s) is 
about 24.4 s. Some observations may deviate 
from 370.7 by a little more than this, some by 
less. 


59. $1,961,160 


61. —3.5. One sample for which these are the 
deviations is 3.8, 4.4, 4.5, 4.8, and 0. 


63. a. gi = 149.5, q3 = 1175, igr = 1175 — 149.5 
= 1025.5 

b. A high outlier is anything exceeding q3 + 
1.S5igr = 1175 + 1.5(1025.5) = 2713.25, 
and an extremely high outlier is anything 
over q3 + 3iqr = 1175 + 3(1025.5) = 
4251.5. 

c. A boxplot shows a positively skewed 
award distribution, with a median award 
of $750 thousand and no apparently 
outliers. 


65. a. 27.82, 26, 27.38 
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b. From software, the quartiles are roughly 
23 and 32, so iqr = 9. Mild outliers are 
outside 23 — 1.5(9) = 9.5 and 32 + 1.5(9) 
= 45.5. Extreme outliers are outside 23 — 
3(9) = 4 and 32 + 3(9) = 59. Hence, 
there is one low mild outlier and there are 
three high mild outliers. Note: Depend- 
ing on how the quartiles and iqr are 
calculated, the observation 46 might or 
might not be deemed an outlier. 


67. The most noticeable feature of the compar- 
ative boxplots is that machine 2’s sample 
values have considerably more variation 
than does machine 1’s sample values. 
However, a typical value, as measured by 
the median, seems to be about the same for 
the two machines. The only outlier that 
exists is from machine 1. 


69. All of the Indian salaries are below the first 
quartile of Yankee salaries. There is much 
more variability in the Yankee salaries. 
Neither team has any outliers. 


71. Outliers occur in the 6 a.m. data. The dis- 
tributions at the other times are fairly sym- 
metric. Variability and the typical values in 
the data increase a little at the 12 noon and 
2 p.m. times. Clearly the 6 a.m. vehicles 
warrant further investigation! 


73. a. 


Males Females 
2 6 Stem: ones digit 
2 Leaf: tenths digit 
1 3 0011 
3 22 
5444 3 5 
776 3 
988 3 8 
00 4 
4 3 


b. x = 3.70 cm for males and 3.15 cm for 
females. 

c. Males’ aortic root diameters are greater, 
on average, than females’ in this sample 
(see the medians above). But the women 
in the sample exhibited much more 
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75. 


Ti: 


79. 
81. 


83. 


variability in aortic root diameter than 
did the men, including some potential 
high and low outliers. 


There are no outliers in the three data sets. 
However, as a comparative boxplot shows, 
the three data sets differ with respect to their 
central values (the medians are different) 
and the data for flow rate 160 is somewhat 
less variable than the other data sets. Flow 
rates 125 and 200 also exhibit a small 
degree of positive skewness. 


a. HC data: s = 9.59. CO data: s = 59.41. 
Since the CO data are on a much larger 
scale, it makes sense that their standard 
deviation should be larger—standard 
deviation reflects absolute scale. 


b. The mean of the HC data is 96.8/4 = 


24.2; the mean of the CO data is 735/4 = 
183.75. Therefore, the coefficient of 
variation of the HC data is 9.59/24.2 = 
3963, or 39.63%. The coefficient of 
variation of the CO data is 59.41/183.75 
= .3233, or 32.33%. Thus, even though 
the CO data has a larger standard devi- 
ation than does the HC data, it actually 
exhibits Jess variability Gin percentage 
terms) around its average than does the 
HC data. 


10.70; 10.60; 10.65 


The IQ distribution for these 33 children is 
reasonably symmetric, with a mean IQ score 
of 113.7 and a standard deviation of 12.7. 
The sample includes three outliers (using the 
1.5iqr rule): a low outlier at 82 and two high 
outliers at 140 and 146. 


a. The typical radon level in houses where a 
child had cancer seems somewhat higher 
than in no-cancer households. Both dis- 
tributions are positively skewed. Radon 
levels of 55, 55, and 85 Bq/m® are 
potential high outliers among the no- 
cancer households, while an extreme 
outlier of 210 Bq/m? was recorded in one 
household with a childhood cancer. 


85. 


87. 


89. 


931 
Cancer No cancer 
9987653 0 33566777889999 
88876665553321111000 1 11111223477 
73322110 2 11449999 
9843 3 389 
5 4 
o* Sao5. 
6 
7 Stem : Tens digit 
HI: 210 8 5 Leaf : Ones digit 


b. s = 31.7 Bq/m’ for the cancer households 
and 17.0 Bq/m* for the no-cancer 
households, suggesting greater variabil- 
ity in the first group. This seemingly 
contradicts the graph, where the radon 
distribution on the left appears more 
concentrated than the one on the right. 

c. iqr = 11.0 for cancer households and 
18.0 for non-cancer households. Now the 
non-cancer households exhibit greater 
variability in radon levels, which is more 
consistent with our graph. The culprit 
here is presumably the extreme value of 
210, which greatly influences the stan- 
dard deviation of the cancer group but 
has no effect on the iqr of that sample. 


The healthy individuals have higher recep- 
tor binding measure on average than the 
individuals with PTSD. There is also more 
variation in the healthy individuals’ values. 
The distribution of values for the healthy is 
reasonably symmetric, while the distribution 
for the PTSD individuals is negatively 
skewed. 


a. Mode = .93. It occurs four times in the 
data set. 

b. The modal category is the one with the 
highest (relative) frequency. 


The measures that are sensitive to outliers 
are: the mean and the midrange. The mean is 
sensitive because all values are used in 
computing it. The midrange is sensitive 
because it uses only the most extreme values 
in its computation.The median, the trimmed 
mean, and the midquarter are not sensitive to 
outliers. The median is the most resistant to 
outliers because it uses only the middle 
value (or values) in its computation. The 
trimmed mean is somewhat resistant to 


932 


outliers. The larger the trimming percentage, 
the more resistant the trimmed mean 
becomes. The midquarter, which uses the 
quartiles, is reasonably resistant to outliers 
because both quartiles are resistant to out- 


li 


ers. 


91. a. s? = s2 and sy = 5, b. s? = Lands, = 1 


93. 
95. 


b. 


a. 


= 
552, .102 c. 30 d. 19 


There may be a tendency to a repeating 

pattern. 

. The value .1 gives a much smoother 
series. 

. The smoothed value depends on all pre- 
vious values of the time series, but the 
coefficient decreases with k. 

. As t gets large, the coefficient (1 — a) 

decreases to zero, so there is decreasing 

sensitivity to the initial value. 


Chapter 2 


1. 


a 
b 


Cc. 


a. 


aoe fj 


ANB 
._ AUB 
(AN BYU(BN A’) 


§ = {1324, 1342, 1423, 1432, 2314, 

2341, 2413, 2431, 3124, 3142, 4123, 

4132, 3214, 3241, 4213, 4231} 

. A= {1324, 1342, 1423, 1432} 

. B = {2314, 2341, 2413, 2431, 3214, 
3241, 4213, 4231} 

. AUB = {1324, 1342, 1423, 1432, 2314, 

2341, 2413, 2431, 3214, 3241, 4213, 

4231} 

ANB=D 

A’ = {2314, 2341, 2413, 2431, 3124, 

3142, 4123, 4132, 3214, 3241, 4213, 

4231} 


. A= {SSF, SFS, FSS} 

. B= {SSS, SSF, SFS, FSS} 

. C= {SSS, SSF, SFS} 

. C' = {SFF, FSS, FSF, FFS, FFF} 
A U C= (SSS, SSF, SFS, FSS} 
AN C={SSF, SFS} 

B U C={SSS, SSF, SFS, FSS} 
B 1 C= {SSS SSF, SFS} 


13. 


15. 


17. 


19. 


. a. 
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{111, 112, 113, 121, 122, 123, 131, 132, 
133, 211, 212, 213, 221, 222, 223, 231, 
232, 233, 311, 312, 313, 321, 322, 323, 
331, 332, 333} 


b. {111, 222, 333} 


Q 


p- 


. SAS and SPSS are not 


. {123, 132, 213, 231, 312, 321} 
. {111, 113, 131, 133, 311, 313, 331, 333} 


. {BBBAAAA, BBABAAA, BBAABAA, 
BBAAABA, BBAAAAB, — BABBAAA, 
BABABAA, BABAABA, — BABAAAB, 
BAABBAA, BAABABA, — BAABAAB, 
BAAABBA, BAAABAB, — BAAAABB, 
ABBBAAA, ABBABAA,  ABBAABA, 
ABBAAAB, ABABBAA, — ABABABA, 
ABABAAB, ABAABBA, ABAABAB, 
ABAAABB, AABBBAA,  AABBABA, 
AABBAAB, AABABBA,  AABABAB, 
AABAABB, AAABBBA, AAABBAB, 
AAABABB, AAAABBB} 

. {AAAABBB, AAABABB, AAABBAB, 
AABAABB, AABABAB} 

07 
30 
57 


. They are awarded at least one of the first 


two projects, .36. 


. They are awarded neither of the first two 


projects, .64. 


. They are awarded at least one of the 


projects, .53. 


. They are awarded none of the projects, 


47. 


. They are awarded only the third project, 


17. 


. Either they fail to get the first two or they 


are awarded the third, .75. 


572 
879 


the only 
packages. 

ud 

8 

2 
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21. 


23. 


25. 


21 


29. 


31. 


33: 


51. 


. a. .0839  b. .2498 


a. .8841 
b. .0435 


a. .10 
b. .18, .19 
c. 41 
d. .59 
e. 31 
f. .69 


a. 1/15 
b. 6/15 
c. 14/15 
d. 8/15 


a. .98 
b. .02 
c. .03 
d. .24 


a. 1/9 
b. 8/9 
c. 2/9 


a. 20 
b. 60 
c. 10 


a. 243 
b. 3645, 10 


. .0679 
. a. 8008 b. 3300 


c. 5236 d..4121, .6538 


.20 


. 0456 


c. .1998 


. IIS, 1/3, 2/3 
. a. 447, 5, 2 


b. P(A|C) = .4, the fraction of ethnic group 
C that has blood type A. 
P(C|A) = .447, the fraction of those with 
blood group A that are of ethnic group C. 
ce. .211 


a. Of those with a Visa card, .5 is the 
fraction who also have a Master Card. 


53. 
55. 
57. 
59. 
65. 
69. 
71. 
73. 


79: 


79. 
81. 
83. 
85. 
87. 


89. 
91. 


93. 
95. 
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b. Of those with a Visa card, .5 is the 
fraction who do not have a Master Card. 

c. Of those with Master Card, .625 is the 
fraction who also have a Visa Card. 

d. Of those with Master Card, .375 is the 
fraction who do not have a Visa Card. 

e. Of those with at least one of the two 
cards, .769 is the fraction who have a 
Visa card. 


.217, .178 

436, 581 

.0833 

a..102 b. 1 

a. .067  b. .509 
a. .765 b. .2353 
466, .288, .247 


a. BB or Bb, with probability 1/2 each 

b. 4/7 c. 2/3 

a. Because of independence, the condi- 
tional probability is the same as the 
unconditional probability, .3. 

b. .82 

c. .146 


349, .651, (1 — py)", 1-(. -p)” 
99999969, .2262 

9981 

b. no 


a. yes 


a. 2p —p” 

Lath py 

. (l-py 

9+ 11 -py 
0137 


ene s 


8588, .9897 
2pl(1 + p) 


a. exact answer = .46_ b. se & .005 


.8159 (answers will vary) 
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97. 
99. 
101. 
103. 


105. 
107. 
109. 
111. 
113. 
115. 


117. 
119. 


121. 
123. 
125. 
127. 
129. 


131. 
133. 
137. 


& .39, © .88 (answers will vary) 
& .91 (answers will vary) 
& .02 (answers will vary) 


.37 (answers will vary) 
176,000,000 (answers will vary; 
exact = 176,214,841) 


V2 


a. © .20 b. & .56 (answers will vary) 
a. .5177_ b. = .4914 (answers will vary) 
& .2 (answers will vary) 

b. m= 4- P(A) 


a. 10,626 b. 255,024 c. 127,512 


a. 1/3, .444 
b. .15 
c. .291 


45, .32 


a. 1/120 
b. 1/5 
c. 1/5 


905 
a. .904  b. .766 
.008 
362, .348, .290 


a. P(G|R, < Ro < R3) = 2/3, so classify as 
granite if Rj < Ry < R3. 

b. P(G|R, < R3 < Ro) = .294, so classify as 
basalt if Ry < R3 < Ro. 
P(G|R3 < R, < R2) = 1/15, so classify as 
basalt if R3 < Ry < Ro. 

c. .175 

d. p> 14/17 


a.1/24 b. 15/24 c.1-e7! 
sal 


a. P(Bo|survive) = bo/[1 — (by + b2)cd] 
P(B,|survive) = (1 — cd)/[1 — (b, + bp)ed] 
P(B,|survive) 
= bo(1 — cd)i[1 — (bi + b2)cd] 

b. .712, .058, .231 
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Chapter 3 


1. 


&. FFF SFF FSF FFS FSS SFS SSF SSS 


x: 


0 1 1 1 2 2 2 3 


3. M = the absolute value of the difference 
between the outcomes with possible values 


11. 


13. 


0, 


1, 2, 3, 4, 5 or 6; W = 1 if the sum of the 


two resulting numbers is even and W = 0 
otherwise, a Bernoulli random variable. 


No, X can be a Bernoulli random variable 
where a success is an outcome in B, with 
B a particular subset of the sample space. 


a. 


b. 


io) 


mona gs 


Possible values are 0, 1, 2, ..., 12; 
discrete 
With N = # on the list, values are 0, 1, 2, 


..., N; discrete 


. Possible values are 1, 2, 3, 4, ...; discrete 
. {x: 0 <x < co} if we assume that a rat- 


tlesnake can be arbitrarily short or long; 
not discrete 


. With c = amount earned per book sold, 


possible values are 0, c, 2c, 3c, ..., 
10,000c; discrete 
{y: 0 < y < 14} since 0 is the smallest 
possible pH and 14 is the largest possible 
pH; not discrete 


. With m and M denoting the minimum 


and maximum possible tensions, respec- 
tively, possible values are {x: m <x < 
M}; not discrete 


. Possible values are 3, 6, 9, 12, 15, ... 


i.e. 3(1), 3(2), 3(3), 3(4), ...giving a first 
element, etc.; discrete 


. X is a discrete random variable with 


possible values {2, 4, 6, 8, ...} 


. X is a discrete random variable with 


possible values {2, 3, 4, 5, ...} 


.10 
243,25 


70 
5 
55 
71 
65 
5 
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15. 


17. 


19. 


21. 


23. 


29: 


29. 


31. 


a. (1, 2) 0, 3) 1, 4) 1, 5) @, 3) (2, 4) (2, 5) 
(3, 4) (3, 5) (4, 5) 

b. p(O) = .3, pU) = .6, p(2) = .1, p(x) = 0 
otherwise 

c. F(O) = .30, FC.) = .90, F(2) = 1. The cdf 
is 


0 x<0 
30 O<x<l 
PONS 4 99). e425 
1 2<x 
a. .81 
b. .162 


c. The fifth battery must be an A, and one of 
the first four must also be an A, so 
p(5) = P(AUUUA or UAUUA or UUAUA 
or UUUAA) = .00324 

d. PY =y)=(y— 1.1)” °0.9’, y = 2, 3, 4, 
Die 


b. pC) = .301, p(2) = .176, p(B) = .125, 
p(4) = .097, p(5) = .079, p(6) = .067, p(7) 
= .058, p(8) = .051, p(9) = .046. Lower 
digits (such as 1 and 2) are much more 
likely to be the lead digit of a number 
than higher digits (such as 8 and 9). 

c. FU) = .301, F(2) = .477, FG) = .602, 
F(4) = .699, F(5) = .778, F(6) = .845, 
F(7) = .903, F(8) = .954, FQ) = 1. So, 
F(x) = 0 for x < 1; F(x) = .301 for 1 < 
x <2; F(x) = .477 for 2 < x < 3; ete. 

d. .602, .301 


F(x) =0,x<0;.10,0 < x<1;.25,1<x< 
2; .45,2 < x<3;.70,3 < x<4;.90,4 < 
x <5; .96,5 < x<6;1.00,6 < x 


a. p(1) = .30, p(3) = .10, p(4) = .05, p(6) = 
15, p(12) = .40 
b. .30, .60 


a. p(x) = (1/3)(2/3)"", x = 1, 2, 3, ... 

b. p(y) = (1/3)(2/3)" *, y = 2, 3, 4, ... 

. pO) = 1/6, p(z) = (25/54)(4/9* 1, 
| nl ak ee 


a. .60 

b. $110 

a. 16.38, 272.298, 3.9936 
. $458.46 


QO 


io” 


33. 
35. 
37. 


39. 


43. 


45. 


47. 


49. 


51. 


55. 


57. 


59. 


63. 


65. 


935 
c. $33.97 
d. 13.66 
Yes, because X(1/x’) is finite. 
$700 
Since $142.92 > $100, you expect to win 


more if you gamble. 


a. —$1/19, -$1/19 

b. The expected return for a $1 wager on 
roulette is the same no matter how you bet. 

c. $5.76, $2.76, $1.00 

d. Low-risk/low-reward bets (such as a color) 
have smaller standard deviation than high- 
risk/high-reward bets (such as a single 
number). 


a. 32.5 

b. 7.5 

c. V(X) = E[X(X — 1)] + E(X) - [EOP 

a. 1/4, 1/9, 1/16, 1/25, 1/100 

b. = 2.64, o = 1.54, P(X — p> 20) = .04 
< .25, P(X - pl > 30) =0< 1/9 
The actual probability can be far below 
the Chebyshev bound, so the bound is 
conservative. 

c. 1/9, equal to the Chebyshev bound 

d. p(-1) = .02, p(O) = .96, p(1) = .02 


M(t) = Se'/(1 — Se’), E(X) = 2, V(X) = 2 


a. Ole™ + .05e!" + 16e!! + .78e!”" 
b. 11.71, 3659 


p(0) = .2, p(1) = .3, p(3) = .5, E(X) = 1.8, 
V(X) = 1.56 


a5,4 b.5,4 


py) = (.25"1(.75) for y = 1, 2, 3, .... 
M(t) = e''?, E(X) = 0, V(X) = 1 


a..124 b..279 c..635 d. .718 


873 
007 
716 
277 
_ 1.25, 1.09 


9 Bes B 


83. 
85. 


89. 
91. 
93. 
95. 
97. 
99. 
101. 
103. 
109. 
111. 
113. 


. a. .786 b. .169 
. a. 403 b. .787 


. a..017 b. 811, .425 


c. .382 


c. .774 


. .1478 


. .407, assuming batteries’ voltage levels are 


independent 


. a. 0104 c. .00197 d. 1500, 260 


c. .006, 902, .586 


. For p = .9 the probability is higher for 


B (.9963 versus .99 for A) 
For p = .5 the probability is higher for A (.75 
versus .6875 for B) 


. a. 20, 16 (binomial, n = 100, p = .2) 


b. 70, 21 


a.p=Oorl b.p=.5 

When p = .5, the true probability for k = 2 is 
.0414, compared to the bound of .25. 
When p = .5, the true probability for k = 3 is 
.0026, compared to the bound of .1111. 
When p = .75, the true probability for k = 2 
is .0652, compared to the bound of .25. 
When p = .75, the true probability for k = 3 
is .0039, compared to the bound of .1111. 


a. .932 b..065 c..068 d..491 e. .251 


a..011 b..441 cc. .554,.459 d. .944 


a..219 b. .558 

857 

a. .122, .808, .283  b. 12,3.464 c. .530,.011 
a. .099 b..135 c.2 
a.4 b..215 c. 1.15 years 


a. .221 b. 6,800,000 c. p(x; 1608.5) 


a..114 b. .879 c..121 d. use Bin(15, .1) 


a. h(x; 15, 10,20) b. 0325 c. .6966 


a. h(x; 10, 10, 20) b. .0325 c. h(x; n, n, 2n), 
E(X) = n/2, V(X) = n7/[4(2n-1)] 


115. 


117. 
119. 
121. 
125. 


127. 


129. 


131. 


133. 
135. 


137. 
139. 
141. 


143. 
145. 
147. 
149. 


151. 


153. 


155. 
157. 
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. nb(x; 2, 5) = (x+1).5**?, x =0, 1,2, 3, ... 


a 

b. 3/16 

c. 11/16 

d. 4,2 

24+24+2=6 

a..2817 b.7513 c..4912, .9123 


a. 160, 21.9 b. .6756 


mean ~ 0.5968, sd = 0.8548 (answers will 
vary) 


~.9090 (answers will vary) 


a. mean * 13.5888, sd = 2.9381 
b. & .1562 (answers will vary) 


mean ~ 3.4152, variance ~ 5.97 
(answers will vary) 


b. 142 tickets 


2291 b. © $8696 c. © $7811 
.2342, $7767, $7571 (answers will vary) 


a. 
d.x 
b. © .9196 (answers will vary) 
b. 3.114, .405, .636 


a. b(x; 15, .75) 
b. .6865 c. 313 
d. 45/4, 45/16 e. 309 


a. .013 b. 19 c. .266 d. Poisson(500) 


a. p(x;2.5) b..067  c..109 
1.813, 3.05 
p(2) =p’, p@) = (1 — p)p’, p(4) = 


=p pe =p 

— p(x — 3))0 — pp’, x = 5, 6,7, ... . 
Alternatively, p(x) = (1 — p)p(x - 1) + 
pd — p)p@ — 2), x =5, 6, 7, ... .99950841 


a. 0029 b. 0767, .9702 


cS [ples 2) 


x=0 


a. .135 b. .00144 


3.590 
a. Nob. .0273 
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159. b. 5, + Sho 
c. 5p + Spo + .25(p1 — po)” 
d. POs Mi, Ha) = -6 pO Hy) + 4 pO be) 
161. .5 
165. X ~ b(x; 25, p), E(h(X)) = 500p + 750, 
Onx, = 100,/p(1 — p). Independence and 
constant probability might not be valid 
because of the effect that customers can 
have on each other. Also, store employees 
might affect customer decisions. 
167. p(0) = .07776, p(1) = .10368, p(2) = .19008, 
p(3) = .20736, p(4) = .17280, p(5) = .13824, 
p(6) = .06912, p(7) = .03072, p(8) = .01024 
Chapter 4 
1. a. .25 b. 5c. 7/16 
3. b. 5c. 11/16 d. .6328 
5. a. 3/8 b. 1/8 c. .2969 d. .5781 
7. a. f(x) = .10 for 25 < x < 35 and = 0 
otherwise 
b. 2c. .4d..2 
9. a. .699 b. .301, .301 c. .166 
11. a. 1/4 b. 3/16 c. 15/16 d. V2 e. f(x) = x/2 
forO < x < 2, and f(x) = 0 otherwise 
13. a.3b,Oforx < 1,1-1/ forx>1 


23; 
27. 


c. 1/8, .088 


p> 


F(x) = 0 for x < 0, F(x) = x°/8 for 
O0<x<2, F(x) =1 forx > 2 
b. 1/64 c. .0137, .0137. d. 1.817 


. 90th percentile of Y = 1.8(90th percentile 
of X) + 32 c. 100pth percentile of Y = 
a(100pth percentile of X) + b for a > 0. 


io” 


a. 35, 25 b. .865 


_ a. 8182, 1113 b. .314 
a. A+(B—A)p 


b. (A + B)/2 
c. (B™! — A™y/[(nt1)(B — A)] 


314.79 
248, 3.6 


29. 
31. 
33. 
35. 


39. 


41. 
43. 


45. 


47. 
49. 


51. 
53. 
55. 
57. 


59. 


61. 
63. 
65. 
69. 


71. 
73. 
75. 


937 


1/4, 1/16 


a. v/20, v/800 b. 100.27 versus 1007 c. 8017 


My(t) = (e* — e~*')/10t, Y ~ Unif[-5, 5] 


a. 


b. 


My(t) = 04! /(.04 —t) for t < .04, 
mean = 35, variance = 625 

M(t) = .04/(.04 — t) for t < .04, mean = 
25, variance = 625 


. My(t) = .04/(.04-—1); Y is a shifted 


exponential rv 


. .4850 b. .3413 c. .4938 d. .9876 e. .9147 
. 9599 g. .9104 h. .0791 i. .0668 j. .9876 


. 1.34 b. -1.34 c. .674 d. —.674 e. -1.555 


. .9772 b. 5c. .9104 d. .8413 e. .2417 
. .6826 


. .1977 b. .0004 c. The top 5% are the 


values above .3987. 


The second machine 


a. 


.2514, ~0 b. 39.985 ksi 


0510 


a. 


a. 


.8664 b. .0124 c. .2718 
.794 b. 5.88 c. 7.94 d. .265 


a. (1.72) — O(.55) b. B(.55) — [1 — ®(1.72)]; 


aoe 


No, due to symmetry. 


. .4584 b. 135.8 kph c. .9265 d. .3173 
. 6844 


. .7286 b. .8643, .8159 
. .9932 b. .9875 c. .8064 


. 0392 b. ~1 


15872 actual .15866 


. 0013495 actual .0013499 
. 999936655 = actual = .999936658 
. 00000028669 actual .00000028665 


. 120 b. 1.329 c. .371 d. .735 e. 0 
.5,4b..715 c. 411 
. 1b. Lc. .982 d. .129 
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Wd: 


79. 


81. 
85. 


89. 
91. 


93. 


95. 
97. 


99. 


101. 
103. 
105. 


107. 


109. 


111. 


113. 


115. 
117. 


a. .449, .699, .148 b. .050, .018 


a. MA; b. Exponential with 2 = .05 
c. Exponential with parameter n/ 


a. Gamma, o = 3, B = 1// b. .8165 


. .275, 599, 126 b. 20.418, 17.365 
c. 15.84 


2 


a. 9295 b. .2974  c. 98.184 


a. 7.53, 9.966 b. .7823, .1469 
c. .6925; lognormal is not symmetric 


a. 149.157, 223.595 b. .957 
d. 148.41 e.9.57 f. 125.90 


a= 
b. Ta + Plan + Py + B + mp), 
Bila. + B) 


Yes, since the pattern in the plot is quite 
linear. 


c. .0416 


Yes 
Yes, because the plot is reasonably straight 


Form a new variable, the logarithms of a 
TN value, and then construct a normal plot 
for its values. Because of the linearity of 
this plot, normality is plausible. 


The pattern in the normal probability plot 
is curved downward, consistent with a 
right-skewed distribution. It is not plausi- 
ble that shower flow rate has a normal 
population distribution. 


The plot deviates from linearity, especially 
at the low end, where the smallest three 
observations are too small relative to the 
others. The plot works for any 2 because 7 
is a scale parameter. 


fy) = 2, y > 1 

fry) = ye", y > 0 
fry) = 1/16, 0 < y < 16 
fry) = U[nd + y°)] 


119. 


121. 


125. 


129. 


131. 
133. 


135. 
137. 
139. 


141. 
143. 
145. 


147. 


149. 
151. 
153. 


157. 


Answers to Odd-Numbered Exercises 


Y = X7/16 
fry) = Feet? for y > 0 


a. F(x) = 37/4, x = 2/u 
c. sample mean and sd = 1.331 and 0.471 
(answers will vary), « = 4/3 and 


o = V2/3 


b. sample mean = 15.9188, close to 16 
(answers will vary) 


$1480 


b. F(x) = 1 — 16x + 4)?, x > 0; F(x) = 0, 
x<0 c..247 d.4 e. 16.67 


.6563 b. 41.55 c. .3197 


~ 


.0003 b. .0888 


p 


. F@) =1.5d - I/x), 1 < x < 3; F@) =0, 
x<1;F@=1,x>3 
b. .9, .4  c. 1.6479 d. 5333 e. .2662 


fo) 


2 


. 1.075, 1.075 b. .0614, .3331 c. 2.476 


io” 


. $95,600, .3300 


b. F(x) = .5e%, x < 0; F(x) = 1- 5e°, 
x>0 c. .5, 6648, .2555, .6703 


k=(a—1)5*! fora>1  b. F(x) =0, 
x < 5; FQ) =1-(5/x)*",x>5 
5(a — 1) — 2) 


~ 


o 


b. .4602, .3636 c. 5950 d. 140.178 
a. Weibull b. .542 


LA beta iB 

F(x) = 1—e%-/P9) 0 <x < B: 
F(x) =0,x <0; F@~) = 1-6 7”, 
x > B, f(x) = a(1 — x/B) eA "/09), 
O<x < pf) =0,x<0,f%) =0,x>f 
This gives total probability less than 1, so 
some probability is located at infinity (for 
items that last forever). 


fo) 


QO 


F(q*) = 818 
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Chapter 5 
1. a..20 b..42_ c. The probability of at least 
one hose being in use at each pump is .70. 
d. x 0 1 2 
px(x) 16 .34 .50 
y 0 1 2 
py(y) .24 .38 .38 
PX < 1)=.50 


13. 


15. 


17. 


19. 


21. 


e. dependent, .30 = P(X = 2 and Y= 2) # 
P(X = 2) P(Y = 2) = (.50)(.38) 


.a.15 b..40 c..22=P(A)=P(\X;—X9| > 2) 
d..17, 46 
a. 0305 b. .1829 


c. probability = .1073, marginal evidence 
a. .054 b. .00018 


a. .030 b. .120 c. .10, .30 d. .38 e. yes, 
P(X%Y) = px(x) + pry) 


. a. 3/380,000_ b. .3024 c. .3593 


d. 10kx7+.05, 20 <x < 30 eno 


a. p(x, y) = eM hp /xly! 
b.e [1 + py + py] 
c. “= (4, + fy)”, Poisson with parameter 


Hy + My 


ae ,x>0,y >0 b..3996 co. .5940 
d. 3298 


a. F(y)=1-2e€°?% +e” for y > 0, 
F(y) = 0 for y < 0; fly) = 
4he >” — 32e°” for y > 0 

b. 2/32) 


a 25. »b. 2nd. fx(x) = 
ar — 2 / (nr?) for-r <x < rn frly) = 
2/7 = |(n?) for-r < y < 7, n0 


1/3 


I/n oc. 


25. 
27. 


29. 
31. 
33. 
35. 


Bs 


41. 
43. 
45. 
47. 
49. 


ol. 


53. 
39. 
57. 
61. 


65. 
67. 
69. 


71. 


939 
y 0 1 2 
Py) 77 14 .09 


c. no d. 0.35, 0.32 e. 95.72 


15 
L 


/4h 

—2/3 

—.1082, -.0131 
238, 51 


V(A(X, ¥)) = E(W?(X, Y)) — [E(A(X, Y))), 
13.34 


p=1whena>0 

a. 87,850, 4370.37 b. yes, no c. .0027 
0336, .2310 

0314 


a. 45 min b. 68.33 
68.33 


c. -1, 13.67. d. -5, 


a. 50, 10.308 b. .0075 c. 50 d. 111.5625 
e. 131.25 


a. .9616  b. .0623 
a. 5, n(n + 1/4 b. .25, nv + 1I)Qn + 1/24 
10:52.76 


a. Bin(10, 18/38) b. Bin(15, 18/38) 
c. Bin(25, 18/38) f. no 


c. Gamma(n, 1//) 
a. 2c. 0, 2n, Wd —f)" d. U1 — £/2n)" 


a. fx(x) = 2x,0<x<1 

b. fyx(y|x) = 1/x,0<y<x<1 

c. .6 

d. no, the domain is not a rectangle 

e. B(Y|X =x) = 2/21. VAX = 2) = x" /12 


a. f(x) = 2e->*, 0<x<00 
b. frix(y|x) = e*,0<x<y<oo 


940 


73. 


75. 


77. 


79. 


81. 


83. 


85. 
87. 
89. 


91. 


93. 


d. no, the domain is not rectangular 
e. E(Y|X =x) =x4+1f.V(Y|X =x) =1 


a. x/2, 7/12 b. Ix, 0<y <x <1c. Ini), 
O<y<1d. 1/4, 7/144 e. 1/4, 7/144 


a. Py|x(O|1) = 4/17, Py\x(|1) = 10/17, 
Pyx(2|L) = 3/17 

b. Pyx(O|2) = .12, Pyx(1|2) = .28, Pyx(2|2) 
= 60 c. .40 

d. Px\y(O|2) = 1/19, Px\y(|2) = 3/19, 
Px\y(2|2) = 15/19 


IBN? ble 0 yer ST 
.O//y)-1,0<y<1 


a. p(,1) = p(,2) = p(3,3) = 1/9, p(2,1) = 
pG,1) = pG,2) = 2/9 

b. px(1) = 1/9, px(2) = 3/9, px(3) = 5/9 

Cc. PyxC|1) =1, Pyx(1|2) = 2/3, Pyx(2|2) as 
1/3, pyx(|3) = 4, pyx(2]3) = 4, 
Pyx3|3) = 2 

d. E(Y|X=1) = 1, E(Y|X=2) = 4/3, E(Y|X=3) 


Qo 2 


= 1.8, no 

e. V(Y|X=1) = 0, V(Y|X=2) = 2/9, V(Y|X=3) 
= 56 

a. pxiv|1) = .2, Pxiy2|1) = A, Px|y3|L) 7 
A, pxy(2|2) = 1/3, pxjy(3|2) = 2/3, 
Pxiv3)3) = 1 

b. E(X|Y=1) = 2.2, E(X|¥=2) = 8/3, 
E(X|Y=3) = 3 

c. ViX|Y=1) = .56, V(X|Y=2) = 2/9, 
V(X|Y=3) = 0 

a. px(x) = .1,x =0, 1, 2, ..., 9; pyxblx) = 


1/9, y=0,1,2,...,,.9 y4#x; 
Px. Ax, y) = 1/90, x, y=0, 1,2, ...,9,yAx 
b. E(Y|X=x) = 5 —x/9, x =0, 1,2, ...,9 


a. .6x, .24x b. 60 c. 60 
176, 12.68 


a. 1+ 4p, 4p. — p) b. 2598, 16,518,196 

c. 2598(1 + 4p), 16518196 + 93071200p — 26998416p 

d. 2598 and 4064, 7794 and 7504, 12,990 
and 9088 


a. normal, mean = 984, variance = 38,988 
b. .1379 c. 1237 


a. N(158, 8.72) b. N(170, 8.72) c. .4090 


Answers to Odd-Numbered Exercises 


95. a. .8875x + 5.2125 b. 111.5775 
c. 10.563 d. .0951 


97. a.2x-10 b.9 


99. a. .1410  b. .1165 
With positive correlation, the deviations 
from their means of X and Y are likely to 
have the same sign. 


c.3  d. .0228 


101. a pe Os b.teeM «Yes 
103. a. fY)=y2-y),0<y<l 
b. fw) =20-w),0<w<l 


105. 4y,[In(y3)) for 0 < y; <1 


109. a. gs(y) = 5x,/10°, 25/3 b. 20/3 c. 5 
d. 1.409 

111. gysiy,(ysl4) = [2/3][(vs — 4)/6]’, 
4<y, < 10; 8.8 

113. 1/(n+1), 2(n+1), 3/(nt)), ..., nl(n+1) 
T(nt I) Pi+1/0 

115. Paris I a 
Tint) (i42/0) — [renr(i+1/6)]? 
T@l(nt+142/0) | T(r (n+ 14+1/0) 

117. a. .0238 b. $2025 


121. a. n(n —1)[F(n) — FO)" “FOF vn) 


for y1<Yypn 
b. fw(w) = fn(n— 1)[F(w+ wi) 
—F(w1)]" °F (wi fv + wi) divs 
c. n(n — 1)w"-7(1—w) for 0<w<1 
123. 
125. 


fr(t) =e? — e+ for t>0 
a. 3/81,250 
30—x 


b. fix) = foo 
O0<x<20 
So kxydy = k(450x — 30x? + 4x3). 
20 <x <30 
fro) = fx) 
dependent 
c. .3548 d. 25.969 e. —32.19, —.894 
f. 7.651 


127. 7/6 


131. c. If pO) = .3, pC) = .5, p(2) = .2, then 1 
is the smaller of the two roots, so 
extinction is certain in this case with 
u< 1. If pO) = .2, pd) =.5, p@) = .3, 


kxydy = k(250x — 10x”) 
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133. 


135. 


137. 


141. 


143. 


145. 


147. 


. F(x, y) = Oxy + Axy’, 0 


then 2/3 is the smaller of the two roots, 
so extinction is not certain with uw > 1. 


. P(X, Y) € A) = Fb, d) - F(b, c) — 


F(a, d) + F(a, b) 


. P(X, Y) € A) = FCO, 6) — F(10, 1) - 


F(4, 6) + F(4, 1) 
P(X, Y) € A) = F(b, d) — F(b, c— 1) - 
Fa-1,d)+Fa-1,b-1) 


. At each (x*, y*), F(x*, y*) is the sum of 


the probabilities at points (x, y) such 
that x < x* andy < y* 
x 
F(x,y)|100 250 
200 |.50 1 
y 100 |.30  .50 
O |.20 .25 


<x 
0, x 


1; 
0; 


IA 1A 


0<y < 1; FQ, y) 
F(x, y)=0,y < 0; 
F(x, y) = .6x7 + .4x,0 <x < ly>1; 
F(x, y) = .6y+ 4y,x>1,0<y <1; 
F(x, y)=1,x>I1,y>1 

PUSS. =e <= FS, OS Sy = 75y= 
23125 


; F(x, y)=6°y, x+y <10<x*<1; 


0O<y<I1x>O0y>0 
F(x, y) = 3x4 — 8x° + 6x7 + 3y* — By? + 
6y = 1.4 $95 1,2 <ly<l 


F(x, y)=0, x < 0; F@, y)=0,y < 0; 
F(x, y) = 3x* — 8x° + 6x°,0 <x <1, 
y>l 

F(x, y) = 3y* -8y° + 6,0 < y <1, 


x>l 
F(x, y)=1,x>I1,y>1 


a. 2x, x b. 40 c. 100 


2 , 1500 hours 


(1 — 10002) (2 — 10007) 


a. 2360, 73.7021 b. .9713 
8340 


Chapter 6 


11. 


13. 


a. 


s° 0 112.5 312.5 800 
p(s?) 38 20 30 12 
AS) =21225 0° 

x/n 0 aL 2 3 A 

p(xln) 0.0000 0.0000 0.0001 0.0008 0.0055 
J 6 7 8 a) 1.0 

0.0264 0.0881 0.2013 0.3020 0.2684 0.1074 

Xx 1 15 2 25 3 35 4 


p(x) 16 24 25) 20 .10s—04s«COL 


P(X <2.5) = .85 

r |O 1 2 3 

pr), 30 40.22 .08 

24 
x P() x P(x) x P() 
0.0 0.000045 1.4 0.090079 2.8 0.052077 
0.2 0.000454 1.6 0.112599 3.0 0.034718 
0.4 0.002270 1.8 0.125110 3.2 0.021699 
0.6 0.007567 2.0 0.125110 3.4 0.012764 
0.8 0.018917 2.2 0.113736 3.6 0.007091 
1.0 0.037833 2.4 0.094780 3.8 0.003732 
1.2 0.063055 2.6 0.072908 4.0 0.001866 
. 12, .01 
. 12, .005 
. With less variability, the second sample 


is more closely concentrated near 12. 


No, the distribution is clearly not sym- 
metric. A positively skewed distribution— 
perhaps Weibull, lognormal, or gamma. 


. .0746 
. .00000092. No, 82 is not a reasonable 


value for Lu. 


942 


15. 
17. 
19. 
21. 


27. 


29. 


35. 
39. 
Al. 


49. 


a. .8366 b. no 
43.29 
a. .9772, .4772  b. 10 


a. .9838 b. .8926 c. .9862 and .8934, both 
quite close 


1/X 


Because 72 is the sum of v independent 
random variables, each distributed as ver the 
Central Limit Theorem applies. 


a. 3.2 b. 10.04, the square of (a) 
a. 4.32 


a. Vo/(v2 — 2), vo > 2 
b. 2v3(%) + v2 — 2) /[vi(v2 — 2)?(2 — 4)], 
v2>A4 


a. The approximate value, .0228, is smaller 
because of skewness in the chi-squared 
distribution 

b. This approximation gives the answer 
.03237, agreeing with the software 
answer to this number of decimals. 


53. a. .9686 b..90 c. .87174 
55. .048 
57. a. .9544 for all nb. .8839, .9234, .9347; 
increases with n toward (a) 
59. a. 2.6,1.2 b. 390, 14.7 c. ~1 
61. .9686 
63. .0722 
65. a. 5774, .8165, .9045 
b. 1.312, 4.303, 18.216 
67. a. .049 b. .09 
Chapter 7 
Le 11393,.%. be 13x 


c. 12.74, S, an estimator for the population 
standard deviation 


13. 


17. 
19. 
21. 


25. 
27. 


29. 


31. 


33. 


. a. 1.3481, Xb. .0846 
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d. The sample proportion of students 
exceeding 100 in IQ is 30/33 = .91 
e. .112, S/X 


c. 1.3481, X 
d. 1.78, X + 1.2825 e. .6736 


. 6; = NX, 6) = T — ND, 6; =T -X/Y; 


1,703,000, 1,591,300, 1,601,438.281 


. a. 120.6 b. 1,206,000, 10,000X c. .8 


d. 120, X 


. a. X, 2.113 b. \/p/n, 119 
11. 


io” 


- Vpi(l — pi)/m + pall — po) /n2 

c. In part (b) replace p, with X,/n, and 
replace p with X>/n2 

d. —.245 e. .0411 


c. [m2 /[(n — 1? (n — 2)? 


a. 0 = 3) X?/(2n) 
4/9 


b. 74.505 


a. p=2A—.30=.20 

b. p = (1004 — 9) /70 

a. .15 b. yes c. .4437 

a. 0 = (2x—1)/(1—%) =3 

b. 6 = [—n/ZIn(x;)] — 1 = 3.12 

p=r/x = .15 This is the number of suc- 
cesses over the number of trials, the same as 


the result in Exercise 25. It is not the same 
as the estimate of Exercise 19. 


a. (210) —"/7e-24i/20 

n n tx 
ee Pe ee: 
di aif > ae 


a. >> x?/2n=74.505, the same as 
Exercise 17 


b. 4/20 1n(2)=10.16 
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35. 


37. 
39. 


41. 
43. 
45. 
47. 


49. 


51. 


53. 


55. 
57. 
59. 


61. 


63. 
67. 
69. 
73. 
75. 


77. 


2 = —In(p)/24 = .0120 
a.X b.X 


No, statistician A does not have more 
information. 


[Dini 8) Doe 
Sox; 

(min{X;}, max{X;}) 
2X(n — X)/[n(n — 1)] 


X, by the Rao-Blackwell Theorem, because 
X is sufficient for u 


a. 1/p*(1 — p) 
b. n/p*(1 — p) 
c. p>(1—p)/n 


a. If we ignore the boundary, 1/0? b. 0°/n 
c. Both are less than (In; Cramér-Rao does 
not apply because the boundaries of the 
uniform variable X include 0 itself 


a.x b.N(u,0//n) 
a. 2/o_b. Yes 


c. Yes d. They agree 


a. I/p, (1 — p)/np?_b. (1 — p)/np? c. Yes 


A = 6/(6t6 — t) — ++» —ts) = 


6/(x1 + 2x. +--+ + 6x6) = .0436, where 
X= tm Hb —h, ....% =the — ts 
2.912, 2.242 

5.93, 11.66 

b. no, E(6?) = 07/2, so 267 is unbiased 


448, .4364 


d(X) = (-1)*, d(200) = 1, d(199) = -1 


p= > xiyi/dlx7= 30.040, the estimated 
minutes per item; 67 = LS (03 — Bx;)°= 
16.912; 258 = 751 
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Chapter 8 
1. a. 99.5% b. 85% c. 2.97 d. 1.15 
3. a. A narrower interval has a lower proba- 
bility b. No, 4 is not random 
c. No, the interval refers to yu, not individ- 
ual observations 
d. No, a probability of .95 does not guar- 
antee 95 successes in 100 trials 
5. a. (4.52, 5.18) b. (4.12, 5.00) c. 55d. 94 
7. Increase n by a factor of 4. Decrease the 
width by a factor of 5. 
9. aXx—z,0/\/n b. 4.418 c. 59.70 
11. 950; .8724 (normal approximation), 
.8731 (binomial) 
13. a. 1.341 b. 1.753 c. 1.708 d. 1.684 e. 2.704 
15. a. 2.228 b. 2.131 c. 2.947 d. 4.604 e. 2.492 
f. 2.715 
17. a. Yes b. (4.89, 5.79) c. (.5868, .6948) fl oz 
19. a. (63.12, 66.66) b. No, data indicates the 
population is not normal 
21. a. (29.26, 40.78) b. (3.61, 73.65); times are 
normally distributed; no 
23. a. (18.94, 24.86) b. narrower c. narrower 
d. (12.09, 31.71) 
25. a. Assuming normality, a 95% lower con- 


27. 
29. 


fidence bound is 8.11. When the bound is 
calculated from repeated independent 
samples, roughly 95% of such bounds 
should be below the population mean. 
b. A 95% lower prediction bound is 7.03. 
When the bound is calculated from 
repeated independent samples, roughly 
95% of such bounds should be below the 
value of an independent observation. 


a. 378.85 b. 413.14 c. (340.16, 401.22) 


(8228.0, 116,042.4); yes (note that negative 
values in PI make its validity suspect) 


944 


33. 


35. 
37. 
39. 
41. 
43. 
45. 
47. 
49. 
51. 
53. 
55. 
57. 


59. 
61. 
63. 


. a. (169.36, 179.37) 


b. (134.30, 214.43), which includes 152 

c. The second interval is much wider, 
because it allows for the variability of a 
single observation. 

d. The normal probability plot gives no 
reason to doubt normality. This is espe- 
cially important for part (b), but the large 
sample size implies that normality is not 
so critical for (a). 


a. (18.413, 20.102) b. 18.852 c. data indi- 
cates the population distribution is not 
normal 


(18.3, 19.7) days 

0.056; yes, though potentially not by much 
97 

a. 80% b. 98% c. 75% 

(.798, 845) 

a. .504 b. Yes, since p > .504 > .5 
584 

(513, .615) 

441; yes 

a. 381 b. 339 

(.028, .167) 


a. 22.307 b. 34.381 c. 44.313 d. 49.925 
e. 11.523 f. 10.519 


(0.4318, 1.4866), (.657, 1.219) 
b. (2.34, 5.60) c. False 


a. (7.91, 12.00) b. Yes 
c. The accompanying R code assumes the 
data has been read in as the vector x. 


N=5000 
xbar=rep (0,N) 
for (i in1:N){ 
resample = sample(x,length(x), 
replace = T) 
xbar [i] =mean (resample) 


} 


65. 


67. 


69. 


71. 
133 


Td: 
77. 
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d. (7.905, 12.005) for one simulation; 
bootstrap distribution is somewhat 
skewed, so validity is questionable 

e. (8.204, 12.091) for one simulation 

f. Bootstrap percentile interval; population 
and bootstrap distributions are both 
skewed 


a. (26.61, 32.94) 

b. Because of outliers, weight gain does not 
seem normally distributed. However, 
with n = 68, the effects of the CLT might 
be enough to validate use of t procedures 
anyway. 

d. (26.66, 32.90) for one simulation; yes, 
because histogram is bell-shaped 

e. (26.69, 32.88) for one simulation 

f. All three are close, so one-sample 
t should be considered valid 


a. (38.46, 38.84) 

b. Although a normal probability plot is not 
perfectly straight, there is not enough 
deviation to reject normality. 

d. (38.47, 38.83) for one simulation; pos- 
sibly invalid because bootstrap distribu- 
tion is somewhat skewed 

e. (38.51, 38.81) for one simulation 

f. All three intervals are surprisingly 
similar. 

g. Yes: all CIs are well above normal body 
temperature of 37°C 


a. (170.75, 183.57) 

b. Plot is reasonably linear, CI in part (a) is 
legitimate. 

d. (170.76, 183.46) for one simulation; very 
similar to (a) 

e. (171.04, 183.00) for one simulation 

f. All three are valid, so use the shortest: 
(171.04, 183.00). None of the CIs cap- 
ture the true p value 
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a. .614  b. 4727.8, no 
c. Yes: .66(7700) = 5082 > 5000 


a. (.163, .174) 
(.1295, .2986) 


b. (089, .326) 
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79 


81. 


85. 


87. 


89. 


91. 


. a. At 95% confidence, average TV viewing 


time for the population of all O-11 
months old children is between 0.8 and 
1.0 hours per day. (Similar interpretation 
for others.) 

b. Samples with larger standard deviations 
and/or smaller sample sizes will result in 
wider intervals. 

c. Yes: none of the intervals overlap 


c. o7/Xx7, o/,/Zx? d. Spread out: 
variance is inversely proportional to sum 
of squares of x values 


on pb ae to25,n—15// pee (29.93, 30.15) 


a. 755,2,/22X;, .0098 
b. exp(—t - 295,2n/22Xi) = .058 


a. (X — torsn—1,6 ° 8/1/01, X — 975 n-1,6 8/4/71) 
b. (3.01, 4.46) 


a. 1/2” b. n/2” c. (n+1)/2”, 1 — (n + 1)/2""1, 
(29.9, 39.3) with confidence level .9785 


a. P(Ay M A>) = .957 = .9025 

b. P(A; NM A>) > .90 

c. P(Ay M Ar) > 1 - 20; 
P(A, MN Ar No A AQ) 
> 1-ka 


Chapter 9 


1. 


5. 


9. 


a. yes b. no c. no d. yes e. no f. yes 


Ho: o = .05 versus H,: o < .05. Type I error: 
Conclude that the standard deviation is less 
than .05 mm when it is really equal to .05 
mm. Type II error: Conclude that the stan- 
dard deviation is .05 mm when it is really less 
than .05. 


. A type I error here involves saying that the 


plant is not in compliance when in fact it is. 
A type II error occurs when we conclude that 
the plant is in compliance when in fact it 
isn’t. A government regulator might regard 
the type II error as being more serious. 


a. R, 
b. Reject Ho 


11. 


13. 


15. 
17. 


19. 


21. 


23% 


27. 
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c. A type I error involves saying that the 
two companies are not equally favored 
when they are. A type II error involves 
saying that the two companies are 
equally favored when they are not. 

d. Bin(25, .5); .0433 

e. B(.3) = B(.7) = .488; 

B(.4) = B(.6) = .845; 
power = .512 for .3 and .7, .155 
for .4 and .6 


. Ho: w = 10 versus Ha: wp # 10 
O1 

. 5319. .0078 

c= 2.58 

c= 1.96 

. X =10.02, so do not reject Ho 


monadstp 


c. .0004, ~0, P(type I error) < « = .O1 
when Ll < Lo 


a. .0301 b. .003 c. .004 


Test Ho: w = .5 versus Hy: pw  .5 


a. Do not reject Hp because to95 12 = 2.179 
> |1.6| 

b. Do not reject Hp because t.o25 12 = 2.179 
> |-1.6] 

c. Do not reject Hp because too524 = 2.797 
> |-2.6] 

d. Reject Hp because to95,24 = 2.797 < 
|-3.9| 


a. Do not reject Hy because |-2.27| < 2.576 
b. .2266 
c. 22 


z= 2.14, so reject Ho at .05 level but not at 
.O1 level 


Because t = 2.24 > 1.708 = toss, reject 
Ho: = 360. Yes, this suggests contradic- 
tion of prior belief. 


. a. Because |-1.40| < 2.064, Ho is not 


rejected at the .05 level. 
b. 600 lies in the CI for 1 


a. no, t = —.02 b. .58 c. n = 20 total 
observations 
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29. 


31. 


33. 


37. 


39. 
Al. 


43. 


45. 


47. 


Since 1.04 < 2.132, we do not reject Ho at 
the .05 significance level. 


a. Because t = .50 < 1.89 = to57 do not 
reject Ho. 

b. .73 

Because t = —1.24 > —1.40 = —ti0., we do 


not have evidence to question the prior 
belief. 


Ll! = Lo 
© LF tenij3n-1, 
: (. a Te) 


= lo 
b. Fl -tyo,-13n— 1 
( a/2n—135 i) 


LH! = Lo 
1—F ba/2,n—-13 =; 
" (12 ae) 


Since |-2.469|] > 1.96, reject Hp. 


a. Do not reject Ho: p = .10 in favor of 
H,: p > .10 because 16 or more blistered 
plates would be required for rejection at the 
.05 level. Because Hp is not rejected, there 
could be a type II error. 

b. B(.15) = .4920 when n = 100; 

B(.15) = .2743 when n = 200 

c. 362 


a. Do not reject Ho: p = .02 in favor of 
H,: p < .02 because z = —1.01 is not in 
the rejection region at the .05 level. 
There is no strong evidence suggesting 
that the inventory be postponed. 

b. B(.01) = .195 

c. 1— B(.05) 0 


a. Test Ho: p = .05 versus H,: p £ .05. Since 
z = 3.07 > 2.58, Ho is rejected. The 
company’s premise is not correct. 

b. B(.10) = .033 


Using n = 25, the probability of 5 or more 
leaky faucets is .0980 if p = .10, and the 
probability of 4 or fewer leaky faucets is 
.0905 if p = .3. Thus, the rejection region is 
5 or more, « = .0980, and f = .0905. 

a. reject 


b. reject cc. do not reject 


51. 


53. 


SD: 


OF: 


59. 


61. 


63. 
65. 
67. 
69. 


71. 
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d. reject e. do not reject 


a. .0778 b. .1841 
e. 5438 


c. 0250 d. .0066 


a. P = .0403 b. P = .0176 
d. P= 6532 e. P =.0021 


c. P = .1304 
f. P = .00022 


Based on the given data, there is no reason to 
believe that pregnant women differ from oth- 
ers in terms of serum receptor concentration. 


a. Because the P-value is .166, no modifi- 
cation is indicated. 
b. .9974 


Because t = —1.759 and the P-value = .082, 
which is less than .10, reject Hp: = 3.0 
against a two-tailed alternative at the 10% 
level. However, the P-value exceeds .05, so 
do not reject Ho at the 5% level. There is 
just a weak indication that the percentage is 
not equal to 3% (lower than 3%). 


a. Test Ho: w = 10 versus Hy: pw < 10 

b. Because the P-value is .017 < .05, reject 
Ho, suggesting that the pens do not meet 
specifications. 

c. Because the P-value is .045 > .01, do not 
reject Hp, suggesting there is no reason to 
say the lifetime is inadequate. 

d. Because the P-value is .OO11, reject 
Ho. There is good evidence showing that 
the pens do not meet specifications. 


Do not reject Ho at .01 or .05, reject Ho at .10 
b. 36.614 c. yes 
a. Xx; >c b. yes 


Yes, the test is UMP for the alternative 
H,: 0 > .5 since the tests for Hp: 0 = .5 versus 
H,: 0 = po all have the same form for pp > .5. 


b. .0502 cc. .04345, .05826, nod. .05114; 
not most powerful 


. —2In(A) = 3.041, P-value = .081 
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75. a. .98, .85, .43, .004, .0000002 
b. .40, .11, .0062, .0000003 
c. Because the null hypothesis will be 
rejected with high probability, even with 
only slight departure from the null 
hypothesis, it is not very useful to do a 
.O1 level test. 
S a5 
25/(n-1) 
b. P-value = P(Z < -3.59) & .0002, so 
reject Ho, 


TT. a. 


79. a. 16.803 
. reject Hy because 15 is not > 16.803 
no 


. reject Ho at .10, uncertain at .01 


aoc 


81. The following R code performs the boot- 
strap simulation described in this section. 


mu0 = 113; N= 5000 

x=c(117.6,109.5,111.6,109.2,119.1,110.8) 

w=x-mean(x) + mu0 

rep (0,N) 

bootstrap means 

for. (in ten) 
resample = 


wbhar = # allocating space for 


sample (w, length (w) , re- 
place=T) 
wbhar[i] = mean(resample) 


} 


The P-value is estimated by the proportion 
of these w; values that are at or below the 
observed x value of 112.9667. In one run of 
this code, that proportion was .5018, so do 
not reject Ho. 


83. a. Ho: the reformulated drug is no safer 
than the original, recalled drug. 
H,: the reformulated drug is safer than 
the recalled drug. 

b. Type I error: The FDA rejects Ho and 
concludes the new drug is safer, when in 
fact it isn’t. Type II error: The FDA fails 
to recognize that H, is true, yet the new 
drug is indeed safer. 

c. Type I (arguably); lower « 


85. Yes, only 25 required 


87. t= 6.4, P-value = 0, reject Hp 
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89. a. no 
b. t = .44, P-value = .33, do not reject Ho 


91. Assuming normality, calculate ¢ = 1.70, 
which gives a two-tailed P-value of .102. Do 
not reject the null hypothesis Ho: w= 1.75. 


93. The P-value for a lower tail test is .0014, so 
it is reasonable to reject the idea that p = .75 
and conclude that fewer than 75% of 
mechanics can identify the problem. 


95. Because the P-value is .013 > .01, do not 
reject the null hypothesis at the .01 level. 


97. a. For testing Ho: “ = Lo versus H,: UL > Uo 
at level «, reject Ho if 22x,/Lo > Tos 
For testing Ho: = Uo versus H,: Lb < Lo 
at level a, reject Ho if 2Xx;/ 9 < canes 
For testing Ho: 4 = Uo versus H,: ue ~ Uo 
at level x, reject Ho if 2Xx/Uo > ye /2,2n 
or if 22x/ lo < oases 

b. Because Xx; = 737, the test statistic value 
is 22x;/Uo = 19.65, which gives a P-value 
of .52. There is no reason to reject the 
null hypothesis. 


99. a. yes 
Chapter 10 


1. a. —4 b. .0724, .269 
c. Although the CLT implies that the dis- 
tribution will be approximately normal 
when the sample sizes are each 100, the 
distribution will not necessarily be 
normal when the sample sizes are 
each 10. 


. Z=4.84 > 1.96, reject Ho 
b. (1251, 2949) 


5. a. H, says that the average calorie output 
for sufferers is more than 1 cal/em?/min 
below that for non-sufferers. Reject 
Ho in favor of H, because z = —2.90 
< —2.33 

b. .0019 
c. .82, .18 
d. 66 
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7. 


17. 


19. 


21. 


23. 


29: 


27. 


29. 
31. 
33% 


35. 


37; 


.ai1l7 b. 21 


a. We must assume here that the population 
elapsed time distributions are both 
normal. 

b. z= 1.32 < 2.576, do not reject Ho 


. 22, no 


. b. It decreases. 


c. 18 d. 26 


. Ao: Ly — by = O versus Ay: My — py < 0; 


t = -15.83, reject Ho 


a. t=-18.64 < —1.686, strongly reject Ho 
b. t=15.66 > 1.680, again strongly reject Ho 
c. at most .10 


a. (219.6, 538.4) 
b. ¢t = 2.20, P-value = .014, reject Hp 


a. No, mean < sd so positively skewed; no 
b. ($115, $375) 


Because ¢t = —3.35 < —3.30 = too1.42, yes, 
there is evidence that experts do hit harder. 


b. No c. Because || = |-.38] < 2.23 = to25,10, 
no, there is no evidence of a difference. 


Because the one-tailed P-value is .0004 < 
.01, conclude at the .01 level that the dif- 
ference is as stated. This could result in a 
type I error. 


Yes, because t = 2.08 with P-value = .046. 
b. (127.6, 202.0) c. 131.75 


Because t¢ = 1.82 with P-value .046 < .05, 
conclude at the .05 level that the difference 
exceeds 1. 


a. The slender distribution appears to have 
a lower mean and lower variance. 

b. With ¢ = 1.88 and a P-value of .097, there 
is no significant difference at the .05 
level. 


With ¢ = 2.19 and a two-tailed P-value of 
.031, there is a significant difference at the 
.O5 level but not the .01 level. 


43. 


45. 


47. 


49. 


51. 
53. 
57. 


59. 
61. 
63. 


65. 


67. 
69. 


71. 


Answers to Odd-Numbered Exercises 


. a. (x = y) = ty/2,m+n—2 * Spv/ 1/m + 1/n 
b. (455, 45) 
c. (448, 38) 


t= 3.88 > 3.365, so reject Ho 


a. (.000046, .000446); yes, because 0 does 
not fall in the CI 
b. t = 2.68, P-value = .01, reject Ho 


a. yes b. $10,524 c. |¢ = |-1.21] < 1.729, 
so do not reject Hp; yes 


a. two-sample t 

b. f= 2.47 > 1.681, reject Hp 
c. paired t 

d. t = -4.34 < -1.717, reject Hp 


b. (12.67, 25.16) 
t = —2.2, P-value = .028, reject Ho 


a. Because |z| = |-4.84] > 1.96, conclude 
that there is a difference. Rural residents 
are more favorable to the increase. 

b. .9967 


(016, .171) 
a. (-.294, —.207) 


Ho: pi — p2 = O versus Ha:p1 — p2 < 0, 
z=-2.01, P-value = .022, reject Hp 


a. p, = the proportion of all students who 
would agree to be surveyed by Melissa, 
P2 = the proportion of all students who 
would agree to be surveyed by Kristine; 
z = 3.00, P-value = .003, reject Ho 

b. No 


769 

a. Ho: p3 = p2 versus H,: p3 > pr 

b. p3 — po = (X3 — Xo) /n 

c. (X3 = Xz) /VX2 + X3 

d. z= 2.68, P-value = .0037, reject Hp at .01 
but not at .00O1. 

a. 3.69 b. 4.82 c. .207 d. .271 


e.4.30 f..212 g. 95 h. .94 


Answers to Odd-Numbered Exercises 


73. 


aD: 
77. 
79. 


81. 


83. 


85. 


f= 1.814 < Fj097 = 2.72, so P-value > .10 
> .01, do not reject Ho 


f = 4.38 < Fo1.11,.9 = 5.18, do not reject Ho 
(0.87, 2.41) 


a. (.158, .735) 

b. Bootstrap distribution of differences 
looks quite normal. 

c. (.171, .723) for one simulation 

d. (.156, .740) for one simulation 

e. All three intervals are quite similar. 

f. Students on lifestyle floors appear to 
have a higher mean GPA, somewhere 
between ~.16 higher and ~.73 higher. 


a. (0.593, 1.246); normal probability plots 
show departures from normality, CI 
might not be valid. 

b. The R code below assumes two vectors, 
L and N, contain the original data. 


ratio = rep(0,5000) 

for (i in1:5000) { 
L.resamp = sample(L,length(L), 

replace=T) 


N.resamp = sample(N,length(N), 


replace=T) 
ratio[i] = sd(L.resamp) /sd(N.re- 
samp) 


} 
CI = (0.568, 1.289) for one simulation 


a. The bootstrap distribution of differences 
of medians is definitely not normal. 

b. (0.38, 10.44) for one simulation 

c. (0.4706, 10.0294) for one simulation 


a. t = 2.62, df = 17, P-value = .018, reject 
Ho at the .05 level 

b. In the R code below, the data is read is as 
a data frame called df with two columns, 
Time and Group. The first lists the times 
for each rat, while the second has B and 
C labels. 


N = 5000 
diff = rep(0,N) 
for (i in1:N) { 


87. 


89. 


91. 


95. 


99. 
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resample = sample(df$Time, length 
(df£$Time) , 
C.resamp = 


replace=T) 
resample [df 
SGroup=="'C"'] 
B.resamp = resample [df 

SGroup=="'B" ] 

diff[i] =mean(C.resamp) - mean(B. 
resamp) 

} 

P-value = 2(proportion of simulated dif- 
ferences > 10.59 — 5.71) = .02 for one 
simulation 

Results of (a) and (b) are similar 


2 


a. f= 4.46; F 95.6.5 = 0.228 < 4.46 < F 05.6.5 
= 4.95, so do not reject Ho at .10 level. 

Use code similar to Exercise 85, but change 
the last line to calculate the ratio of 
resampled variances. Observed ratio = 
4.48, proportion of ratio values > 4.48 
was .086 for one simulation, so P-value = 
2(.086) = .172, and Ho is again not rejected. 


Z 


a. Use the code from Exercise 85. Observed 
difference = 3.47, proportion of simu- 
lated differences > 3.47 was .019 for 
one simulation, so P-value = 2(.019) = 
.038. Reject Ho at the .05 level. 

Results are similar; not surprising since 
both methods are valid. 


a. ($6.40, $11.85) 
b. Use code provided in Chapter 8; boot- 
strap distribution of d is not normal. 

c. ($6.44, $11.81) for one simulation 

d. ($6.23, $11.51) for one simulation 

e. (a) and (c) are similar, while (d) is shifted 
to the left; (d) is most trustworthy 

. On average, books cost between $6.23 
and $11.51 more with Amazon than at 
the campus bookstore! 


s 


— 


The difference is significant at the .05, .01, 
and .001 levels. 


b. No, given that the 95% CI includes 0, the 
test at the .05 level does not reject 
equality of means. 


101. (—299.2, 1517.8) 


105. 


109. 


111. 


113. 


115. 


117. 


119. 


121. 


b. Similarly, ¢ = 2.19, P-value = 


. (1020.2, 1339.9) Because 0 is not in the 


CI, we would reject equality of means at 
the .01 level. 


Because ¢ = 2.61 and the one-tailed 
P-value is .007, the difference is significant 
at the .05 level using either a one-tailed or 
a two-tailed test. 


. a fy = true mean AEDI score improve- 


ment for all 2001 students; Ho: 4, = 0 
versus H,: pt; > 0; t = 2.41, df = 36, 
P-value = .011, reject Hp. 

.020, 
reject Ho. 


c. Ho: fy — Mo = O versus Ho: fy — U2 < 0, 


t = -0.23, P-value = .411, do not reject 
Ho. The data does not suggest an 
“Enron effect.” 


Because t = 7.50 and the one-tailed 
P-value is .0000001, the difference is 
highly significant, assuming normality. 


The two-sample ¢ test is inappropriate for 
paired data. The paired ¢ gives a mean 
difference .3, t = 2.67, and the two-tailed 
P-value is .045, so the means are signifi- 
cantly different at the .05 level. We are 
concluding tentatively that the label 
understates the alcohol percentage. 


Because the paired ¢ = 3.88 and the two- 
tailed P-value is .008, the difference is 
significant at the .05 and .01 levels, but not 
at the .001 level. 


a. t= 11.86 > 2.33, reject Ho at .01 level. 

b. ¢ = 8.99, again clearly reject Ho. 

c. Yes, because students were randomly 
assigned to experimental groups. 


.902, .826, .029, .00000003 


Because z = 4.25 and the one-tailed 
P-value is .00001, the difference is highly 
significant and companies do discriminate. 


With Z=(X—Y)//X/n+Y/m, the 
result is z = —5.33, two-tailed P-value = 
.0000001, so one should conclude that there 
is a significant difference in parameters 
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Chapter 11 


1. 


11. 


17. 


19. 


a. Reject Ho: fy = My = 3 = My = Ms in 
favor of Ha: Ly, Lo, Ms, La, Ms not all the 
same, because f= 5.57 > 2.69 = Fo5,.4.30- 


b. Using Table A.8, .001 < P-value < .01. 


(The P-value is .0018) 


SSTr = 2304, SSE = 4200, f= 5.76 > 
F'95,2,21 = 3.47, reject Ho 


_ a. SSTr = 9982.4, MSTr = 1109.16 


b. ‘Source df SS MS r 
Brand 9 99824 1109.16 8.53 
Error 30 =3900.0 130.00 
Total 39 


8.53 > Fo1,9,30 = 3.07, so reject Ho. 


Source df SS MS f 
Type 3 127375 42458 25.09 
Error 20 33839 1692 

Total 23 161214 


P-value ~% .000, so reject Ho. 


a. SSTr = 270, MSTr = 90, SSE = 17446, 
MSE = 167.75 

b. f= 0.54 < Fo5.3.104 © 2.69, so Ho is not 
rejected at the .05 level. 


b. Ao: Wy = bo = Ms = M4 = Ls VS Ai: not all 
L’s are equal, f= [18797.5/4] / [251.5/15] 


= 4699.38/16.77 = 280.28, strongly 
reject Hp. 
Qos515 =4.37, and 4.37,/272.8/4 = 
36.09 
3 1 4 2 5 
437.5 462.0 469.3 512.8 532.1 
3 1 4 2 p) 
427.5 462.0 469.3 502.8 532.1 
4 3 1 2 
562.02 698.07 713.00 756.93 


(6.401, 10.589) 
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23. a. SOO BSP CW SWE BMIPS ST BSF N_ GMIPS GS 


117— 122,127,129 141 142, 144 147 148 


b. (—16.79, -0.73) 
25. 422.16 < SSE < 431.88 


27. SSTr = 465.5, SSE = 124.5, f= 17.12 > 
Fo5,3,14 = 3.34, so reject Ho at .05 level. 


29, a. With large sample sizes, normality is less 
important. 11.32/9.13 < 2 indicates equal 
variances is plausible. 

b. SSTr = 2445.7, SSE = 118,632.6, f = 
20.92 > Fos541015 = 2.38, so Ho is 
rejected. 


C. Qos5,5,1015 © 3.86, dj = 3.86,/ 482 (443) 


Graduate 
45.55 


Senior 
52.92 


Junior 
52.89 


Freshman Sophomore 
48.95 51.45 


31. wu; = true mean impact of social media, as a 
percentage of sales, for the ith category; 
SSTr = 3804, SSE = 76973, f = 8.85 > 
F 012,358 © 4.66, so Hp is rejected at the .01 
significance level. 


33. a. The distributions of the polyunsaturated 
fat percentages for each of the four reg- 
imens must be normal with equal 
variances. 

b. SSTr = 8.334, SSE = 77.79, f= 1.714 < 
F 103,50 = 2.20, so P-value > .10 and Ho 
is not rejected. 


35. jl; = true mean change in CMS under the ith 
treatment; SSTr = 19.84, SSE = 16867.6, 
f= 0.1129 < F 052,96 = 3.09, so Ho is not 
rejected at the .05 level. 
37. When Ap is true, all the «;’s are 0, and 
E(MSTr) = o”. Otherwise, E(MSTr) > o”. 

39, 2 = 10, F05,3,14 = 3.344. From R, B = 
pf (3.344,df£1=3,d£2=14,ncp=10)= 
.372, and so power = 1 —- Bf = 1 — .372 = 
628. 

41. a. The sample standard deviations are very 
different. 


43. 


45. 


47. 


49. 


951 


b. For the transformed data, y. = 2.46, SSTr 
= 26.104, SSE = 7.748, f= 70.752, so Ho 
is clearly rejected. 


c. PW 
1.30 


h(x) = arcsin(/x/n) 


a. MSA = 7.65, MSE = 4.93, f4 = 1.55. 
Since 1.55 < F'o5,4,12 = 3.26, don’t reject 
Hoa. 

b. MSB = 14.70, fe = 2.98 < Fo5212 = 
3.49, don’t reject Hog. 


MK 
2.28 


NW 
2.62 


50K FR 
2.75 2.82 


a. Source df Ss MS f 
Method 5 596,748  119,349.6 9.67 
Block 16 529,100 3306.9 0.27 
Error 80 987,380 12342.3 
Total 101 2,113,228 

b. Ao: a =--+: = a5 = 0 versus H,: not all 


a’s are 0. Since 9.67 > F'o1.5.39 = 3.255, 
we reject Ho at the .01 level. 
c. 1=6,J=17, MSE = 12342.3, Qo1.6.80 % 


4.93, HSD = 4.93,/12342.3/17 = 132.8. 


Hip 

Wrist Hip Wrist + 

acc. acc. Pedometer +LFE  LFE Hand tally 

449 466 557 579 606 668 

a. Source df SS MS f P-value 

Spindle 2 16106 8052.8 10.47 0.026 
speed 
Feed rate 2 2156 = 1077.8 1.40 0.346 
Error 4 3078 = 769.4 
Total 8 = 21339 


b. The test statistic value and P-value for 
Aba: & = % = #3 = 0 versus H,,: not 
all «’s = 0 are f = 10.47 and P = .026. 
Since .026 < .05, we reject Ho, at the 
.O5 level and conclude that mean tem- 
perature varies with spindle speed. 

c. The test statistic and P-value for 
Hop: 6; = Bo = B3 =0 versus Hp: not all 
B’s=Oare f= 1.40 and P= .346. Since .346 
> .05, do not reject Hog at the .05 level; 
conclude that feed rate has no statistically 
significant effect on mean temperature. 
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53. 


D0. 


57. 


61. 


& Source df SS MS f 
Flavor 2 20,797 10,398.5 35.0 
Block 53 135,833 2,562.9 8.6 
(Subject) 

Error 106 31,506 297.2 
Total 161 = 188,136 
With f = 35.0 = F 01,2.106 x 4.81, 
Hoa: 0% = & = 03 = 0 is rejected at the 
.O1 level. 
d. Yes 
Source df SS MS f P-value 
Current 2 106.78 53.39 0.19 0.833 
Voltage 2 56.05 28.03 0.10 0.907 
Error 4 1115.75 278.94 
Total 8 1278.58 


According to the ANOVA table, neither 
factor has a statistically significant effect at 
the .10 level: both P-values are > .10. 


With f = 8.69 > 6.01 = F128, there are 
significant differences among the three 
treatment means.The normal plot of resid- 
uals shows no reason to doubt normality, 
and the plot of residuals against the fitted 
values shows no reason to doubt constant 
variance. There is no significant difference 
between treatments B and C, but Treatment 
A differs (it is lower) significantly from the 
others at the .01 level. 


Because f = 8.87 > 7.01 = Fo 4, reject the 
hypothesis that the variance for B is 0. 


a. 

Source df ss MS F 
A 2 30763 15381.5. 3.79 
B 3 34185.6 11395.2 2:81 
Interaction 6 43581.2 7263.5. 1.479 
Error 24 97436.8 4059.9 

Total 35 205966.6 


b. Because 1.79 < 2.51 = F564, there is 
no significant interaction. 

c. Because 3.79 > 3.40 = F524, there is a 
significant difference among the A means 
at the .05 level. 


63. 


65. 


67. 
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. Because 2.81 < 3.01 = F..95,6,24, there is 
no significant difference among the B 
means at the .05 level. 

. Using d = 64.93, 


3 1 2 
3960.2 4010.88 4029.10 


. With f= 1.55 < 2.81 = F402,12, there is 
no significant interaction at the .10 level. 

. With f = 376.27 > 18.64 = Foo12,12, 
there is a significant difference between 
the formulation means at the .001 level. 
With f= 19.27 > 12.97 = Fo01,1,12, there 
is a significant difference among the 
speed means at the .001 level. 

. Main effects Formulation: (1) 11.19, 
(2) —11.19 Speed: (60) 1.99, (70) —5.03, 
(80) 3.04 


. Factor #1 = firing distance, levels = 25 yd, 
50 yd. Factor #2 = bullet brand, levels = 
Federal, Remington, Winchester. Treat- 
ments: (25, Fed), (25, Rem), (25, Win), 
(50, Fed), (50, Rem), and (50, Win). 


. The interaction plot suggests a huge 


distance effect. There appears to be very 
little bullet manufacturer effect. The non- 
parallel pattern suggests perhaps a slight 
interaction effect. 


Source df SS MS f P-value 
Distance 1 568.97 568.969 242.56 0.000 
Bullet 2 2.97 1.487 0.63 0.531 
Distance 2 2.48 1.242 0.53 0.589 
* bullet 
Error 444 1041.49 2.346 
Total 449 1615.92 
Source DF Ss MS F 
pen 3 1387.5 462.50 0.34 
surface 2 2888.1 1444.04 1.07 
Interaction 6 8100.3 1350.04 1.97 
Error 12 8216.0 684.67 
Total 23. 20591...8 
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69. 


73. 


TD: 


With f= 1.97 < 2.33 = F 106,12, there is no 
significant interaction at the .10 level. 
With f = .34 < 3.29 = F4936, there is no 
significant difference among the pen means 
at the .10 level. 

With f = 1.07 < 3.46 = F'192.6, there is no 
significant difference among the surface 
means at the .10 level. 


Source df  AdjSS Adj MS Va P-value 
Distance 2 562424 281212 360.70 0.000 
Temperature 2 11757 5879 7.54 0.004 
Distance 4 21715 5429 6.96 0.001 

* Temperature 

Error 18 14033 780 

Total 26 609930 


The ANOVA table indicates a highly 
statistically significant interaction effect 
(f = 6.96, P-value = .001). The interaction 
by itself indicates that both nozzle-bed dis- 
tance and temperature play a significant role 
in determining strut width. Apply Tukey’s 
method here to the nine (distance, temper- 
ature) pairs to identify honestly significant 
differences. 


Distance*Temperature N Mean Grouping 

0.2 220 3 935.000 A 

0.2 180 3 860.000 A B 

0.2 200 3 806.667 B 

0.3 200 3 676.667 Cc 

0.3 220 3 643.333 c 

0.3 180 3 610.000 Cc D 

0.4 220 3 538.333 DE 
0.4 200 3 511.667 E 
0.4 180 3 505.000 E 

a. MSAB/MSE b. MSA/MSAB, 

MSB/MSAB 

Source df SS MS i 
Treatment 3 81.1944 27.0648 22.36 
Block 8 66.5000 8.3125 6.87 
Error 24 29.0556 1.2106 

Total 35 176.7500 


77. 


79. 


81. 


83. 
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Since 22.36 > Fo5,3.24 = 3.01, reject Hoa. 
There is an effect due to treatments. Next, 


Qo5,424 = 3.90, so Tukey’s HSD is 
3.90,/1.2106/9 = 1.43. 

1 4 ) 2 
8.56 9.22 10.78 12.44 


a. Ao: My = by = Ly = My VS. H,: at least 
two of the y;’s are different; f = 3.68 < 
F 013,20 = 4-94, thus fail to reject Hp. The 
means do not appear to differ. 

b. We reject Hp when the P-value < «a. 
Since .029 > .01, we still fail to reject Ho. 


SSTr = 6.172, SSE = 1045.75, f = 
3.086/18.674 = 0.165 < F 05,2,56 = 3.16, sO 
do not reject Ho. 

Source DF SS MS F 
Diet 4 -929 soe, Ae lS 
Error 25 2.690 .108 

Total 29 3.619 


Because f= 2.15 < 2.76 = F05,4.25, there is 
no significant difference among the diet 
means at the .05 level. 


b. (—.144, .474) Yes, the interval includes 0. 

c. .53 

a. SSTr = 19812.6, SSE = 1083126, f = 
9906.3/7125.8 = 1.30 < Fo5.2,152 = 3.06, 
and Ho is not rejected. 


b. Qosais. © 3.347 and dyj= 


3.347, /H8 (+ + +) for each pair. Then 


diz = 42.1, dz = 37.2, and dz3 = 40.7. 
None of the sample means are nearly this 
far apart, so Tukey’s method provides no 
statistically significant differences. This is 
consistent with the results in part (a). 
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Chapter 12 


1. 


a. Both the BMI and peak foot pressure 
distributions appear positively skewed 


with some gaps and possible high 
outliers. 
Stem-and-leaf of BMI Stem-and-leaf of Foot 
pressure 
1 12 8 q 3 0012344 
6 13 00588 16 3 566666678 
13 14 2456689 18 4 11 
19 15 000569 (8) 4 56789999 
21 16 69 16 5 34 
21 17 01156 14 5 577778 
16 18 677 8 6 024 
13 19 5 6 6 
13 20 0156 4 7 4 
9 21 0126 3 7 
5 22 4 3 8 1 
4 23 1 2 8 59 
3 24 27 
1 25 
1 26 5 
Leaf Unit = 0.1 Leaf Unit = 10 
b. No 


c. The scatterplot suggests some positive 
association between BMI and peak foot 
pressure, but the relationship does not 
appear to be very strong, and there are 
many outliers from the overall pattern. 


3. Yes. 


5: 


11. 


13. 


b. Yes 
c. The relationship of y to x is roughly 
quadratic. 


48.75 mpg 
—.0085 mpg 
—4.25 mpg 
4.25 mpg 


095 m/min _b. —.475 m?/min 
.83 m°/min, 1.305 m?/min 
4207, 3446 e. .0036 


a.—.01h,-.10h b.3.0h,2.5h  c. .3653 
d. .4624 


a. y= .63 + .652x 
b. 23.46, -2.46 


ae Pp 


15. 


17. 


19. 


21. 


25. 


29. 
31. 


33. 


35. 


37. 


Answers to Odd-Numbered Exercises 


c. 392, 5.72 
d. 95.6% 

e. y = 2.29 + .564x, R® = 68.8% 
a. y = 14.6497 + .09092x 

b. 1.8997 

c. -.9977, -.0877, .0423, .7823 
d. 42% 


a. Yes 
b. slope, .827; intercept, —1.13 
c. 40.22 

d. 5.24 

e. 97.5% 


a. y = 75.212 — .20939x, 54.274 

b. 79.1% 

c. 2.56 

b. y = 0.398 + 3.080x 

c. A 1-cm increase in palprebal fissure 
width corresponds to an estimated 3.080 
cm? increase in average/expected OSA. 

d. 3.452 cm? 

e. 3.452 cm* 


new slope = 1.82, new intercept = 
LS py 32 

Bt =Y and Bt = B, 

a. .0756 


b. .813 
c. Then=7 sample is preferable (larger S,.,). 


Ho: 6; =0 versus Hy: 8, 40, t = 22.64, 
P-value ~ O, so there is a useful linear rela- 
tionship. 

CI = (.748, .906) 


a. B, = 1.536, and a 95% CTis (.632, 2.440) 

b. Yes, for the test of Hp: 6, = 0 versus 
H,: Bi # 0, we find t = 3.62, with 
P-value .0025. At the .01 level conclude 
that there is a useful linear relationship. 

c. Because 5 is beyond the range of the 
data, predicting at a dose of 5 might 
involve too much extrapolation. 

d. The observation does not seem to be 
exerting undue influence. 

a. Yes, for the test of Ho: Bh, = 0 versus 
H,: By # 0, we find t = -6.73, with 
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43. 


45. 


47. 


49. 


51. 


55. 


57. 


59. 


io” 


aandaeep 


ona FT 


P-value < 10 °. At the .01 level conclude 
that there is a useful linear relationship. 


2.77, VAD) 


. 600 is closer to x = 613.5 than is 750 
. (2.258, 3.188) 

. (1.336, 4.110) 

. at least 90% 


y = -1.5846 + 2.58494x, 83.73% 


. (2.16, 3.01) 

. (0.125, 0.058) 

. (0.559, 0.491) 

. Ao: Ly.7 = 0 versus Ay: by,7 # 0; reject 


Ho because 0 is not in the confidence 
interval (0.125, 0.325) for uy.7 


(86.3, 123.5) 


a. 


ono St 


—s 


t = 4.88, P-value ~ 0, so reject Ho. A 
useful relationship exists. 


. (64.2, 161.3) 

. (10228, 11362) 

. (8215, 13379) 

. Wider, because 85 is farther from the 


mean x-value of 68.65 than is 70. 


. No, extrapolation 


g. (8707, 10633); (10020, 11574); 


(10845, 13004) 
Yes 


. t = 6.45, P-value = .003, reject Hp. A 


useful relationship exists. 


. (05191, .11544) 
. (00048, .16687) 


. For the test of Hp: p = 0 versus H,: p > 0, 


we find r = .7482, t = 3.91, with P-value 
< .05. At the .05 level conclude that there 
is a positive correlation. 


R= .56; it is the same no matter which 


variable is the predictor. 


. t = 174 < 2.179, so do not reject 


A: p = 0. 


. R? = 20% 


. (829, .914) 
. Z = 2.412, P-value = .008, so reject 


Ho: p = 0 


61. 


65. 


67. 


69. 


71. 
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c. R?=77.1% 
d. Still 77.1% 


a. Reject the null hypothesis in favor of the 
alternative. 

b. No, with a large sample size a small r can 
be significant. 

c. Because ft = 2.200 > 1.96 = tops 990g the 
correlation is statistically (but not nec- 
essarily practically) significant at the .05 
level. 


a. .184, —.238, —.426 
b. The mean that is subtracted is not the 


mean X,,-1 Of x, X2, ..., X,1, or the 
mean X72, Of X2, x3, ..., X,. Also, the 
denominator of ry is not 


Et tis) See) 

However, if nm is large then 7, is 

approximately the same as the correla- 

tion. A similar relationship applies to r. 

No 

d. After performing one test at the .05 level, 
doing more tests raises the probability of 
at least one type I error to more than .05. 


© 


The plot suggests that the regression model 
assumptions of linearity/model adequacy and 
constant error variance are both plausible. 


a. The plot does not show curvature, but 
equal variance is not satisfied. 

b. The standardized residual plot is similar 
to (a). The normality plot suggests nor- 
mality of the true errors is plausible. 


a. For testing Ho: 6; = 0 versus H,: 6; £ 0, 
t = 10.97, with P-value .0004. At the 
.OO1 level conclude that there is a useful 
linear relationship. 

b. The residual plot shows curvature, so the 
linear relationship of part (a) is 
questionable. 

c. There are no extreme = standardized 
residuals, and the plot of standardized 
residuals is similar to the plot of ordinary 
residuals. 


956 


Cc. 


. The plot indicates there are no outliers, 
but there appears to be higher variance 
for middle values of filtration rate. 

. e;/e;’s range between .57 and .65, which 

are close to S,. 

Similar to the plot in (a). 


75. The first data set seems appropriate for a 


Ts 


79. 


81. 
83. 


85. 


87. 


89. 


S 
S. 
S 


traight-line model. The second data set 
hows a quadratic relationship, so the 
traight-line relationship is inappropriate. 


The third data set is linear except for an 
outlier, and removal of the outlier will allow 
a line to be fitted. The fourth data set has 
only two values of x, so there is no way to 
tell if the relationship is linear. 


a 


b. 


b. 
c. 


fet) 


p 


es 


io” 


. $24,000 
$16,300 


9.193 

f = 82.75, P-value ~ 0, so at least one of 
the four predictors is useful, but not 
necessarily all four. 

Compare each to foo12595 = 
Predictors 1, 2, and 4 are useful. 


3.106. 


.y = -77 + 4.397x, + 165x2 c. $2,781,500 


R? = 34.05%, 5, = 0.967 

. Ho: Bi = Bo = Bz = 0 versus H,: not all 
three f’s are 0; f = 2.065 < 3.49, so do 
not reject Ho at the .05 level 

Yes 


y = 148 — 133x, + 128.5x2 + 0.035 1x3 

. For x1: t = —0.26, P = .798. For x2: 
t = 9.43, P < .0OO1. For x3: ¢ = 1.42, 
P = .171. Only x, is a Statistically sig- 
nificant predictor of y. 

. R? = 86.33%, R2 = 84.17% 

R? = 86.28%, R? = 84.91% 


f = 87.6, P-value ~ 0, strongly reject Hp 
R? = 93.5% 
. (9.095, 11.087) 


Always decreases 
. 61.432 GPa 
. f = 227.88 > 5.568, so Ho is rejected. 
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d. (55.458, 67.406) 

e. (53.717, 69.147) 

91. a. Both plots exhibit curvature. 

b. No 

c. The plots suggest that all model 
assumptions are satisfied. 

d. All second-order terms should be 
retained. 

93. a. Ho: B; = 6, =O versus H,: not both 
By and Bo are 0, f= 22.91 > F 95.2.9 = 
4.26, so reject Hp. Yes, there is a useful 
relationship. 

b. Ho: 6b. = 0 versus H,: fb, £0, t = 4.01, 
P-value < .005, reject Ho. Yes. 

c. (5443, 1.9557) 

d. (2.91, 6.29) 

95. a. The quadratic terms are important in 
providing a good fit to the data. 

b. A 95% PI is (.560, .771). 

97. a. ray = .843 (P-value = .000), rra = .621 
(.001), ry4 = .843 (.000) 

b. Rating = 2.24+ 0.0419 IBU—0.166 ABV. 
Because the predictors are highly corre- 
lated, one is redundant. 

c. Linearity is an issue. 

e. The regression is quite effective, with R 
= 87.2%. The ABV coefficient is not 
significant, so ABV is not needed. The 
highly significant positive coefficient for 
IBU and negative coefficient for its 
square show that Rating increases with 
IBU, but the rate of increase is lower at 
higher IBU. 

1 -1 -l 1 
1 -1l 1 1 
99. a.X= 1 1 a4 | Y= th | 
1 1 1 4 
4 0 0 6 
0 4 0 b= |2 
0 0 4 4 
1.5 
b. b=] .5 
E 
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101. 


105. 


109. 


Wren oO 
| 
— 


SSE = 4, MSE = 4 
d. (—12.2, 13.2) 
e. For the test of Ho: 6; = O versus 
A: By x 0, we find |¢| =5< to2s.1= 12.7, 
so do not reject Ho at the .05 level. The x, 
term does not play a significant role. 


f. source DF ss MS F 
Regression 2 BAe O62 
Error 1 4 4.0 
Total 3 9 


With f= .625 < 199.5 = F501, there is no 
significant relationship at the .05 level. 


: rw | nm 2x; 
oe Be sul 
! -1 _ 1 Ex? — Xx; 
(XX) = Sree bes ° 
Ly; 
|e i 
b Xy= Bae 
p= yr (Sry /Sxx)X 
Soi Sex 


B=), &=YCO-y/m- 0), 


yt toosn-18//n 


sil m+n 
0 ~ m+n 1 


B 
1 mei = — Je 
b. y = [v1. .-Yiy2-. ¥2] 
Ss YT 04-5)? + mele —3y)’, 
se = /SSE[(mtn—), 
Sp = Sev/1/m+ 1/n 
d. By = 128.166, B, = —14.333, 
§ = (121, 121, 121, 135.33, 135.33, 135.33], 
SSE = 116.666, s, = 5.4, 
95% CI for 6; (—26.58, —2.09) 


a. With f= 12.04 > 9.55 = Fo 97, there is a 
significant relationship at the .01 level. 
To test Ho: 6; = 0 versus H,: B; 4 0, we 
find |] = 2.96 > to25.7 = 2.36, so reject 


w=y, Br=2>dT yi 


/ 
* 


111. 


113. 


115. 


117. 


119. 
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Ho at the .05 level. The foot term is 
needed. 

To test Ho: Bz = 0 versus H,: fo 4 0, we 
find |t| = 0.02 < to25.7 = 2.36, so do not 
reject Hp at the .05 level. The height 
term is not needed. 


. The highest leverage is .88 for the fifth 


point. The height for this student is 
given as 54 inches, too low to be correct 
for this group of students. Also this 
value differs by 8" from the wingspan, 
an extreme difference. 


. Point 1 has leverage .55, and this stu- 


dent has height 75, foot length 13, both 
quite high. 

Point 2 has leverage .31, and this stu- 
dent has height 66 and foot length 8.5, 
at the low end. 

Point 7 has leverage .31 and this stu- 
dent has both height and foot length at 
the high end. 


. Point 2 has the most extreme residual. This 


student has a height of 66” and a wingspan 
of 56” differing by 10”, so the extremely 
low wingspan is probably wrong. 


. For this data set it would make sense to 


eliminate points 2 and 5 because they 
seem to be wrong. However, outliers 
are not always mistakes and one needs 
to be careful about eliminating them. 


. p(10) = .060, p(50) = .777 
. odds(10) = .0639, odds(50) = 3.49 
. $37.50 


. Ho: B, = 0 versus Ho: By 4 0, 


Zz = —2.026, Ho is rejected at « = .05 


. (.675, .993) 
. Bo = —.0573 and B, = .00430 


c. Ho: B; = 0 versus Ho: fb, 4 0, z = 0.74, 


Be op 


Ho is not rejected. 
912 

.794 

484 


. 2 = 0.38, z2 = 1.75, so do not reject 


Ho: f, = 0 but do reject Hp: Bz = 0 in 
favor of H,: Bo # 0. 
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121. 


123. 


125. 


127. 


129. 


133. 


. 50.73% 
. To test Ho: 8, =0 versus H,: B; 4 0, we 


. Flood damage increases with flood 


level, but there are two “jumps” at 2—3 
ft and 5-6 ft. 
No 


b. .7122 


have t = 3.93, with P-value .0013. At 
the .01 level conclude that there is a 
useful linear relationship. 


d. (1.056, 1.275) 
e. = 1.014, y—-§=-214 


No, if the relationship of y to x is linear, 
then the relationship of y to x is quadratic. 


a. 


mo aogp 


Yes 

. § = 98.293, y-§ =.117 
Se = 155 
R* = .794 


. 95% CI for f,: (.0613, .0901) 


The new observation is an outlier, and 
has a major impact: 

The equation of the line changes from 
y = 97.50 + .0757x to y = 97.28 + .1603x 
s. changes from .155 to .291 

R changes from .794 to .616 


The paired ¢ procedure gives t = 3.54 
with a two-tailed P-value of .002, so at 
the .01 level we reject the hypothesis of 
equal means. 


. The regression line is y = 4.79 + .743x, 


and the test of Ho: 8; = 0 vs H,: 6, £0, 
gives t = 741 with a P-value of 
<.000001, so there is a_ significant 
relationship. However, prediction is not 
perfect, with R= .753, so one variable 
accounts for only 75% if the variability 
in the other. 


. linear 
. After fitting a line to the data, the 


residuals show a lot of curvature. 


. Yes, In(y) = 3.1564 + 0.00481 1x, 


%= 23.486, B = 0.004811 


. (54.42, 112.36) 


135. 


137. 


QO 


a. 


_y= 
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. A linear relationship is plausible. 
. y = 31.04 — 5.79x; model utility ¢ = 


—4.25, P-value = 0, so pH is a statisti- 
cally useful predictor of mean crown 
dieback. 


. PI = (1.42, 14.33), Cl = (6.42, 9.32) 
. PI = (4.69, 18.00), CI = (9.18, 13.52) 


y = 84.82 + .1643x, — 79.67x> and R? = 
.654 


: R? = .831 with interaction, .7207 for 


full second-order model. The model 
with an interaction term but without 
quadratic terms is preferred. 

6.22 + 5.779x,; + 51.33x. —- 
9.357x 1X2, 39.32 MPa 


. First-order: R? = 66.22%; with interac- 


tion, R? = 68.27%; full second-order: 
R> = 70.42%. These suggest that the 
full second-order model is “best” for 
predicting adsorbability. 


Chapter 13 


1. a. reject Ho b. do not reject Ho 
c. do not reject Ho d. do not reject Ho 


3. Do not reject Hy because VM =157<7.815 = 


2 
X05,3° 


5. Because ra = 6.61 with P-value .68, do not 
reject Ho. 


7. Do not reject Hy because ra =4.41<7.779 = 


e) 
X10,5-1° 


Q, a. 


b. 


11. a. 


[0, .223), [.223, 510), [.510, .916), 
[.916, 1.609). [1.609, 00) 

Because 7” = 1.25 with P-value >.10, do 
not reject Hp. 


(—oo, -.967), [-.967, —.431), [-.431, 0), 
[0, .431), [.431, .967), [.967, oo) 


. (—00, .49806), [.49806, .49914), [.49914, 


50), [.50, .50086), [.50086, .50194), 
[.50194, oo) 


. Because v = 5.53 with P-value >.10, do 


not reject Hp. 


Answers to Odd-Numbered Exercises 


13. 


15. 


17. 


19. 


21. 


23. 


25. 


27. 


29. 


31. 


a. With 0 = P(male), 0 = .504, 7? = 3.45, 
df = 4-1-1 = 2. Do not reject Ho 
because 77 > aes = 5.992. 

b. No, because the expected count for the 
last category is too small. 


ju = 3.167 which gives 7? = 103.9 with 
P-value < .001, so reject the assumption of a 
Poisson model. 


The observed test statistic value is 77 = 
6.668 < 10.645 (df = 9 — 1 —-2 =6), so Ho is 
not rejected at the .10 level. 


7° = 2.788 = 2°; P-values are the same. 
Reject Hp at the .10 level but not at the .05 
level. 


a. 7° = 4.504 < 77454 = 9.488, so Ho is not 
rejected. 


L = 9.858 < 7456 = 12.592, so Ho is not 
rejected at the .05 level. 


a. Reject Hy because y* = 11.954 at 2 df 
and P-value = .003. 

b. Very large sample sizes make the test 
capable of detecting even slight devia- 
tions from Ho. 

Cie = Ni. Nj NN? 

df=VK-7+J+kK)+2=28. 


a. Because .6806 < 77). = 4.605, Ho is not 
rejected. 

b. Now a = 6.806 >4.605, and Hp is 
rejected. 

c. 677 


a. With a = 6.45 and P-value .040, reject 
independence at the .05 level. 

b. With z = —2.29 and P-value .022, reject 
independence at the .05 level. 

c. Because the logistic regression takes into 
account the order in the professorial 
ranks, it should be more sensitive, so it 
should give a lower P-value. 

d. There are few female professors but 
many assistant professors, and _ the 


33. 


35. 


37. 


39. 


41. 


43. 
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assistant professors will be the professors 
of the future. 


va = 5.934, df = 2, P-value = .059. So, at the 
.O5 level, we (barely) fail to reject Ho. 


0 = 29.775 > Y%o5 (413-1) = 12.592. So, 
Hp is rejected at the .05 level. 


a. Ho: The population proportion of Late 
Game Leader Wins is the same for all 
four sports; H,: The proportion of Late 
Game Leader Wins is not the same for all 
four sports. With ¢ = 10.518 > 7.815 = 
05,3» reject the null hypothesis at the .05 
level. Sports differ in terms of coming 
from behind late in the game. 

b. Yes (baseball) 


¢ = 881.36, df = 16, P-value is effectively 
zero. USA respondents were more amenable 
to torture than the Europeans, while South 
Korean respondents were vastly more likely 
than anyone else to say it’s “sometimes” 
okay to torture terror suspects. 


a. No, x7 = 9.02 > 7.815 = 7453. 
b. With 7? = .157 < 6.251 = 7419, there is 
no reason to say the model does not fit. 


a. Ho: po =P1 Po = .10 versus 
H,: at least one p; € .10, with df = 9. 

b. Apo: pj = 01 for i and j = 0, 1, 2, ..., 9 
versus H,: at least one p; # .01, with df 
= 99. 

c. No, there must be more observations 
than cells to do a valid chi-square test. 

d. The results give no reason to reject 
randomness. 


Chapter 14 


1. 
3% 
De 


(va, Yis) = ($55,000, $61,000) 
(Y14, Yo7) 


P-value = 1 — B(18; 25, .5) = .007, so reject 
Ho at the .05 level. 
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7: 


11. 


13. 


17. 


19. 


21. 
23. 
25. 


P-value = 1 — B(16; 24, .5) = .032, so reject 
Ho at the .05 level. 


Assuming distribution of differences is 
symmetric, let p = the true proportion of 
individuals who would perceive a longer 
time for the shorter exam (positive differ- 
ence) in this experiment. Hypotheses are 
equivalent to Ho: p = .5 versus H,,: p > .5. 
P-value ~ 0, so Hp is strongly rejected. 


5, = 27, and since 27 is neither > 64 nor 
< 14, we do not reject Ho. 


5, = 22 < 24, so Hp is not rejected at the .05 
level. 


. Test Ho: Up = 0 versus Ho: Up # 0. 5, = 72 


> 64, so Ho is rejected at level .05. 


a. Test Ho: Up = 0 versus Hy: Up < 0. 5, =2 
< 6, and so Ap is rejected at the .055 
level. 

b. Test Ho: Up = 0 versus Hy: Up > 0. 
Because s, = 28 < 30, Ho cannot be 
rejected at this level. 


a. Assume that the population distribution 
of differences is at least symmetric. With 
5s, = 3 < 6, Ap is rejected. P-value = 
021. 

b. Assume that the population distribution 
of differences is normal. tf = —2.54, df = 
7, P-value = .019, so Hp is rejected. 

c. The P-values were .035 (sign test), .021 
(Wilcoxon signed-rank test), and .019 
(paired t¢ test). As is typical, the P-value 
decreases with more powerful tests. But, 
all three tests agree that Hp is rejected at 
the .05 level, and the sign test has the 
fewest assumptions. 


(7.22, 7.73) 
(-.1745, —.0110) 


With w = 38, reject Hp at the .05 level 
because the rejection region is {w > 36}. 


27. 


29. 


31. 


33. 


43. 


45. 


49. 


Answers to Odd-Numbered Exercises 


Test Ho: My — fo = 1 versus Ay: My — bo > I. 
After subtracting | from the original process 
measurements, we get w = 65. Do not reject 
Ho because w < 84. 


b. Test Ho: 4 — fo = 0 vs Aa: My — bo < 0. 
With a P-value of .0027 we reject Ho at 
the .01 level. 


Test Ho: My — Uy =0 vs Ha: My — by > O. 
W has mean mm + n + 1)/2 = 59.5 
and variance mn(m + n + 1)/12 = 89.25. 
z = 2.33, P-value = .01, so Hp is rejected at 
the .05 level. 


Pain: z = —1.40, P-value = .0808. Depres- 
sion: z = —2.93, P-value = .0017. Anxiety: 
z = -4.32, P-value < .0001. Fail to reject 
first Ho, reject last two. Chance of at least 
one type I error is no more than .03. 


. (16, 87) 
. h= 21.43, df = 3, P-value < .0001, so Ho is 


strongly rejected. 


. h=9.85, df =2, P-value = .007 < .01, so Ho 


is rejected. 


a. Rank averages of the three positions/ 
rows ate 7. = 12/6 =2,%. = 13/6 
= 2.16,73. = 11/6 = 1.83; Fr = 0.333, 
df = 2, P-value © .85, so Hp is certainly 
not rejected. 


b. Fr = 6.34, df = 2, P-value = .042, so 


reject Hp at the .05 level. 
c. Fr= 1.85, df = 2, P-value = .40, so Ho is 
not rejected. 


Ao: fy = +++ = yo versus H,: not all p;’s 
are equal. Fr = 78.67, df = 9, P-value = 0, 
so Ho is resoundingly rejected. The four 
algorithms inspired by quantum computing 
(Q’s in the name) have much lower rank 
means, suggesting they are far better at 
minimizing entropy. 

Test Ho: fy — fy =0 vs Aa: Wy — by > 0. 
Rank sum for first sample = 26, 
P-value = .014. Reject Ho at .05 but not .01. 


Answers to Odd-Numbered Exercises 


51. 


53: 


55. 


57. 


mean = 637.5, variance = 10731.25 


a. z=—0.21, P-value = .83. The data do not 
contradict a claim the sensory test is 
reliable. 


b. z= 1.70, P-value = .09. Reject Hp at .10 


level. This might indicate a lack of reli- 
ability of the sensory test for the popu- 
lation of healthy patients. 


Test Ho: uw) =--: = fs versus H,: not all 
Hs are equal. h = 20.21 > 13.277, so Ho is 
rejected at the .01 level. 


Li; = population mean skin potential (mV) 
with the ith emotion (i = 1 for fear, etc.). 
The hypotheses are Ao: wy =--- = My 
versus H,: not all yu;’s are equal. Fr = 6.45 < 
rare = 7.815, so we fail to reject Ho at the 
.05 level. 


Because w’ = 26 < 27, do not reject the null 
hypothesis at the 5% level. 
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Chapter 15 


Ly 


. Normal, pu, = 


a. 7(.50) = .80 and 7(.75) = .20 
b. 2(.50|HHHTH) = .6124 and 
m(.75|HHHTH) = .3876 


a. Gamma(QQ, 5/3) 
b. Gamma(145, 5/53) 


» 0 =A + xx;, by = l/(n 3 3 1/Bo) 


. & =a +nr, B, = By + Xx; — nr 


Tog + UInx; 


T =TH+Nn 
T +n , 


. a. 13.68 


b. (11.54, 15.99) 


. Normal, mean = 116.77, variance = 10.227, 


same as previous 


. 6485, .535) 


ao + XX; 


a. ao Po zi n+1/Bo 


Overview and Descriptive Statistics 


Chambers, John, William Cleveland, Beat Kleiner, and 
Paul Tukey, Graphical Methods for Data Analysis, 
Brooks/Cole, Pacific Grove, CA, 1983. A highly 
recommended presentation of graphical and pictorial 
methodology in statistics. 

Cleveland, William S., The Elements of Graphing Data 
(2nd ed.), Hobart Press, Summit, NJ, 1994. A very 
nice survey of graphical methods for displaying and 
summarizing data. 

Freedman, David, Robert Pisani, and Roger Purves, 
Statistics (4th ed.), Norton, New York, 2007. An 
excellent, very nonmathematical survey of basic 
statistical reasoning and concepts. 

Hoaglin, David, Frederick Mosteller, and John Tukey, 
Understanding Robust and Exploratory Data Analy- 
sis, Wiley-Interscience, New York, 2000. Discusses 
why, as well as how, exploratory methods should be 
employed; it is good on details of stem-and-leaf 
displays and boxplots. 

Moore, David S. and William I. Notz, Statistics: 
Concepts and Controversies (9th ed.), Freeman, San 
Francisco, 2016. An extremely readable and enter- 
taining paperback that contains an intuitive discussion 
of problems connected with sampling and designed 
experiments. 

Peck, Roxy, et al. (eds.), Statistics: A Guide to the 
Unknown (4th ed.), Thomson-Brooks/Cole, Belmont, 
CA, 2005. Contains many short, nontechnical articles 
describing various applications of statistics. 

Utts, Jessica, Seeing Through Statistics (4th ed), Cengage 
Learning, Boston, 2014. The focus is on statistical 
literacy and critical thinking; a wonderful exposition. 


Probability and Probability 
Distributions 


Balakrishnan, N., Norman L. Johnson, and Samuel Kotz, 
Continuous Univariate Distributions, Vol 1 (3rd ed.), 


Wiley, Hoboken, NJ, 2016. Encyclopedic, not for 
bedtime reading. 

Carlton, Matthew A. and Jay L. Devore, Probability with 
STEM Applications, Wiley, Hoboken, NJ, 2021. An 
expansion of the material in Chapters 24€“6 of this 
book (Modern Mathematical Statistics with Applica- 
tions) plus additional material. 

Gorroochurn, Prakash, Classic Problems of Probability, 
Wiley, Hoboken, NJ, 2012. An entertaining excursion 
through 33 famous probability problems. 

Johnson, Norman L, Adrienne W. Kemp, and Samuel 
Kotz, Univariate Discrete Distributions (3rd _ ed.), 
Wiley, Hoboken, NJ, 2005. Encyclopedic, not for 
bedtime reading. 

Olofsson, Peter, Probabilities: The Little Numbers That 
Rule Our Lives (2nd ed.), Wiley, Hoboken, NJ, 2015. 
A very non-technical and thoroughly charming intro- 
duction to the quantitative assessment of uncertainty. 

Ross, Sheldon, A First Course in Probability (10th ed.), 
Prentice Hall, Upper Saddle River, NJ, 2018. Rather 
tightly written and more mathematically sophisticated 
than this text but contains a wealth of interesting 
examples and exercises. 

Ross, Sheldon, Introduction to Probability Models (12th 
ed.), Academic Press, Cambridge, MA, 2019. Another 
tightly written exposition of somewhat more 
advanced topics, again with a wealth of interesting 
examples and exercises. 

Winkler, Robert, Introduction to Bayesian Inference and 
Decision (2nd ed.), Probabilistic Publishing, Sugar 
Land, Texas, 2003. A very good introduction to 
subjective probability. 


Basic Inferential Statistics 


Casella, George and Roger L. Berger, Statistical Infer- 
ence (2nd ed.), Cengage Learning, Boston, 2001. The 
focus is on the theory of mathematical statistics; 
exposition is appropriate for advanced undergraduates 
and MS. level students. Hopefully there will be a new 
edition soon. 
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Daniel, Cuthbert and Fred S. Wood, Fitting Equations to 
Data: Computer Analysis of Multifactor Data (2nd 
ed.), Wiley-Interscience, New York, 1999. Contains 
many insights and methods that evolved from the 
authors’ extensive consulting experience. 

DeGroot, Morris, and Mark Schervish, Probability and 
Statistics (4th ed.), Pearson, Englewood Cliffs, NJ, 
2011. Includes an excellent discussion of both general 
properties and methods of point estimation; of 
particular interest are examples showing how general 
principles and methods can yield unsatisfactory 
estimators in particular situations. 

Davison, A.C. and D.V. Hinkley, Bootstrap Methods and 
Their Application, Cambridge University Press, Cam- 
bridge, UK, 1997. General principles and methods 
interspersed with many examples. 

Efron, Bradley, and Robert Tibshirani, An Introduction to 
the Bootstrap, Chapman and Hall, New York, 1993. 
The first general accessible exposition, and still 
informative and authoritative. 

Good, Philip, A Practitioner’s Guide to Resampling for 
Data Analysis, Data Mining, and Modeling, Chapman 
and Hall, New York, 2019. Obviously brand new, and 
hopefully informative about recent bootstrap 
methodology. 

Meeker, William Q., Gerald J. Hahn, and Luis A. 
Escobar, Statistical Intervals: A Guide for Practition- 
ers and Researchers (2nd ed.), Wiley, Hoboken, NJ, 
2016. Everything you ever wanted to know about 
statistical intervals (confidence, prediction, tolerance, 
and others). 

Rice, John, Mathematical Statistics and Data Analysis 
(3rd ed.), Cengage Learning, Boston, 2013. A nice 
blending of statistical theory and data analysis. 

Wilcox, Rand, Introduction to Robust Estimation and 
Hypothesis Testing (4th ed.), Academic Press, Cam- 
bridge, MA, 2016. Presents alternatives to the infer- 
ential methods based on ¢ and F distributions. 


Specific Topics in Inferential Statistics 


Agresti, Alan, An Introduction to Categorical Data 
Analysis (3rd ed.), Wiley, New York, 2018. An 
excellent treatment of various aspects of categorical 
data analysis by one of the most prominent research- 
ers in this area. 

Chatterjee, Samprit and Ali Hadi, Regression Analysis by 
Example (Sth ed.), Wiley, Hoboken, NJ, 2012. A 
relatively brief but informative discussion of selected 
topics. 

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. 
Dunson, Aki Vehtari, and Donald B. Rubin, Bayesian 
Data Analysis (3rd ed.), Chapman and Hall, New 
York, 2013. A comprehensive survey of theoretical, 
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practical, and computational issues in Bayesian infer- 
ence; its authors have made many contributions to 
Bayesian methodology. 

Hollander, Myles, and Douglas Wolfe, Nonparametric 
Statistical Methods (3rd ed.), Wiley, Hoboken, NJ, 
2013. A very good reference on distribution-free (i.e. 
non-parametric) methods with an excellent collection 
of tables. 
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methodology. 
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half of the book contains a well-presented survey of 
ANOVA. The level throughout is comparable to that 
of the present text (generally without proofs); the 
comprehensive discussion makes the book an excel- 
lent reference. 
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to Statistical Methods and Data Analysis (7th ed.), 
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chapters on ANOVA and regression methodology that 
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mathematical exposition; there is a good chapter on 
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analysis, including extensive information on the 
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Statistical Computing and Simulation 


Datar, Radhika and Harish Garg, Hands On Exploratory 
Data Analysis with R, Packt Publishing, Birmingham, 
UK, 2019. Explains how the R software package can 
be used to explore various types of data. 

Law, Averill M., Simulation Modeling and Analysis (Sth 
ed.), McGraw-Hill, New York, 2014. An authoritative 
survey of various aspects and methods of simulation. 


A 
Additive model, 674 
for ANOVA, 674-676 
for linear regression analysis, 712 
for multiple regression analysis, 
767 
Additive model equation, 704 
Adjusted coefficient of multiple 
determination, 772 
Adjusted R?, 772 
Alternative hypothesis, 502 
Analysis of covariance, 790 
Analysis of variance (ANOVA), 639 
additive model for, 674-676, 687 
data transformation for, 667 
definition of, 639 
expected value in, 646, 662, 679, 
687 
fixed vs. random effects, 667 
Friedman test, 888 
fundamental identity of, 644, 
653, 677, 690, 722 
interaction model for, 687-695 
Kruskal—Wallis test, 887 
Levene test, 649-650 
linear regression and, 746, 749, 
764, 798, 805 
mean in, 640, 642, 643 
mixed effects model for, 682, 692 
multiple comparisons in, 
653-660, 666, 679-680, 691 
noncentrality parameter for, 663, 
671 
notation for, 642, 707 
power curves for, 663-664 
randomized block experiments 
and, 680-682 
regression identity of, 722—723 
sample sizes in, 663-664 
single-factor, 640-672 
two-factor, 672-695 
type I error in, 645-646 
type II error in, 662 
Anderson-Darling test, 835 


ANOVA table, 647 
Approximate 100(1—x)% confidence 
interval for p, 476 
Ansari—Bradley test, 888 
Association, causation and, 301, 
754 
Asymptotic normal distribution, 372, 
436, 443, 445, 754 
Asymptotic relative efficiency, 867, 
876 
Autocorrelation coefficient, 757 
Average 
definition of, 26 
deviation, 33 
pairwise, 446, 868-869, 876 
rank, 887 
weighted, (see Weighted 
average) 


B 
Balanced study design, 642 
Bar graph, 9, 18 
Bartlett's test, 649 
Bayes estimator, 897 
Bayesian approach to inference, 855, 
896-897 
Bayes' Theorem, 80-83, 896-897 
Bernoulli distribution, 118, 139, 150, 
375, 441-443, 442, 898 
Bernoulli random variable, 112 
binomial random variable and, 
146, 376 
Cramer—Rao inequality for, 443 
definition of, 112 
expected value, 126 
Fisher information on, 438-439, 


442 

Laplace's rule of succession and, 
900 

mean of, 127 

mle for, 443 

moment generating function for, 
139, 140, 143 
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pmf of, 118 
score function for, 440 
in Wilcoxon's signed-rank 
statistic, 314 
Beta distribution, 244-245, 897 
Beta functions, incomplete, 244 
Bias, 400, 487 
Bias-corrected and accelerated 
interval, 490, 617 
Bimodal histogram, 17 
Binomial distribution 
basics of, 144-151 
Bayesian approach to, 897-900 
multinomial distribution and, 286 
normal distribution and, 
223-224, 375 
Poisson distribution and, 
157-159 
Binomial experiment, 144-147, 150, 
158, 286, 375, 823 
Binomial random variable X, 146 
Bernoulli random variables and, 
150, 375 
cdf for, 148 
definition of, 146 
distribution of, 148 
expected value of, 150, 151 
in hypergeometric experiment, 
167 
in hypothesis testing, S04—505, 
526-529 
mean of, 150-151 
moment generating function for, 
151 
multinomial distribution of, 286 
in negative binomial experiment, 
168 
normal approximation of, 
223-224, 375 
pmf for, 148 
and Poisson distribution, 
157-159 
standard deviation of, 150 
unbiased estimation, 406, 434 
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variance of, 150, 151 
Binomial theorem, 151, 168-170 
Bioequivalence tests, 637 
Birth process, pure, 445 
Bivariate, 2 
Bivariate data, 2, 706, 710, 720, 777, 
820 

Bivariate normal distribution, 
330-334, 550, 749-750 

Blocking, 680-682 

Bonferroni confidence intervals, 
499, 740-742, 776 

Bootstrap, 484 

Bootstrap distribution, 485 

Bootstrap procedure 

for confidence intervals, 

484-492, 617-619 

for paired data, 622-624 
Bootstrap P-value, 557 
Bound on the error of estimation, 

457 
Box—Muller transformation, 342 
Boxplot, 37-39 

comparative, 39-42 
Branching process, 352 
Bootstrap sample, 485 
Bootstrap standard error, 486 
Bootstrap ¢ confidence interval, 489 


C 
Categorical characteristic, | 
Categorical data 
classification of, 18 
graphs for, 18 
in multiple regression analysis, 
787-790 
Pareto diagram, 25 
sample proportion in, 18 
Cauchy distribution 
mean of, 384, 408 
median of, 408 
minimal sufficiency for, 432 
reciprocal property, 274 
standard normal distribution and, 
342 
uniform distribution and, 263 
variance of sample mean for, 414 
Causation, association and, 301, 754 
cdf. See Cumulative distribution 
function 
Cell counts/frequencies, 824-826, 
829-834, 840-847 
Cell probabilities, 828, 829, 834 
Censored experiments, 31, 409-410 
Censoring, 409 
Census, | 
Central Limit Theorem (CLT) 
basics of, 371-377, 395-396 


Law of Large Numbers and, 376 
proof of, 395-396 
sample proportion distribution 
and, 224 
Wilcoxon rank-sum test and, 877 
Wilcoxon signed-rank test and, 
864 
Central ¢ distribution, 383-384, 497 
Chebyshev's inequality, 137, 156, 
187, 228, 337, 380 
Chi-squared critical value, 481 
Chi-squared distribution 
censored experiment and, 496 
in confidence intervals, 457-458, 
482 
critical values for, 382, 458, 
481-482, 550, 825, 834-835 
definition of, 236 
degrees of freedom for, 380, 381 
exponential distribution and, 389 
F distribution and, 385-386 
gamma distribution and, 231, 
388 
in goodness-of-fit tests, 823-839 
Rayleigh distribution and, 263 
standard normal distribution and, 
263, 391-392, 395 
of sum of squares, 333, 643 
t distribution and, 383, 385 
in transformation, 259 
Weibull distribution and, 274 
Chi-squared random variable 
in ANOVA, 645 
cdf for, 380 
expected value of, 313 
in hypothesis testing, 565 
in likelihood ratio tests, 549, 553 
mean of, 388 
moment generating function of, 
380 
pdf of, 381 
standard normal random 
variables and, 381-382 
in Tukeya€™s procedure, 659 
variance of, 390 
Chi-squared test 
degrees of freedom in, 825, 831, 
833, 841, 845 
for goodness of fit, 823-829 
for homogeneity, 841-843 
for independence, 844-846 
P-value for, 836-837 
for specified distribution, 
828-829 
z test and, 836 
Classes, 14 
Class intervals, 14-16, 346, 364, 
835, 837 
Coefficient of determination, 721 


Index 


definition of, 720-722 
F ratio and, 774 
in multiple regression, 772 
sample correlation coefficient 
and, 746 
Coefficient of (multiple) 
determination, 771 
Coefficient of skewness, 138, 144, 
211 
Coefficient of variation, 44, 272, 423 
Cohort, 351 
Combination, 70—72 
Comparative boxplot, 42-43, 588, 
589, 642 
Complement of an event, 52, 59 
Complete second-order model, 786 
Composite, 542 
Compound event, 51, 61 
Concentration parameter, 895 
Conceptual population, 6, 128, 359, 
569 
Conditional expectation, 320 
Conditional density, 319 
Conditional distribution, 317—327, 
428, 435, 749, 833, 889 
Conditional mean, 320-327 
Conditional probability, 75-83, 
87-88, 236-238, 428, 
431-432 
Conditional probability density 
function, 317 
Conditional probability mass 
function, 317 
Conditional variance, 320-322, 433 
Confidence bound, 460-461, 464, 
567, 575,393 
Confidence interval 
adjustment of, 480 
in ANOVA, 654, 659-660, 666, 
678, 680, 692 
based on t distribution, 470-473, 
575-577, 592-594, 646, 
659-661, 730-733 
Bonferroni, 499, 740-742 
bootstrap procedure for, 
484-491, 622, 617-619, 624 
for a contrast, 660 
for a correlation coefficient, 753 
vs. credibility interval, 899-903 
definition of, 451 
derivation of, 457 
for difference of means, 
579-580, 581-583, 592-595, 
617-619, 632-633, 647-651, 
666, 679, 693 
for difference of proportions, 609 
distribution-free, 855-860 
for exponential distribution 
parameter, 458 


Index 


in linear regression, 704—707, 
739-741 
for mean, 452-456, 458, 
464-465, 484-488, 490-491 
for median, 419-421 
in multiple regression, 773, 822 
one-sided, 460, 567, 593 
for paired data, 593-595, 623 
for ratio of variances, 615-616, 
621 
sample size and, 456 
Scheffé method for, 702 
sign, 860 
for slope coefficient, 729 
for standard deviation, 481-482 
for variance, 481-482 
width of, 453, 456-457, 467, 
478, 490, 568 
Wilcoxon rank-sum, 873-875 
Wilcoxon signed-rank, 867—869 
Confidence level 
definition of, 451, 454-456 
simultaneous, 654-659, 666, 
672, 741 
in Tukey's procedure, 654-659, 
666, 679, 680 
Confidence set, 867 
Conjugate prior, 893 
Consistent, 408 
Consistency, 377, 424, 443-444 
Consistent estimator, 377, 424, 
443-444 
Contingency tables, two-way, 
840-848 
Continuity correction, 223-224 
Continuous random variable(s) 
conditional pdf for, 318, 903 
cumulative distribution function 
of, 195-200 
definition of, 114, 190 
vs. discrete random variable, 192 
expected value of, 203-204 
joint pdf of (see Joint probability 
density functions) 
marginal pdf of, 281-283 
mean of, 203, 204 
moment generating of, 208-210 
pdf of (see Probability density 
function) 
percentiles of, 198-200 
standard deviation of, 205—207 
transformation of, 258-262, 
336-341 
variance of, 205-207 
Contrast, 660 
Contrast of means, 659-660 
Convenience samples, 6 
Convergence 
in distribution, 164, 258 


in mean square, 377 
in probability, 377 
Convex function, 275 
Convolution, 307 
Correction factor, 167 
Correlation coefficient, 299 
autocorrelation coefficient and, 
Tat 
in bivariate normal distribution, 
330-334, 749 
confidence interval for, 753 
covariance and, 298 
Cramér—Rao inequality and, 
441-442 
definition of, 299, 746 
estimator for, 749 
Fisher transformation, 751 
for independent random 
variables, 299 
in linear regression, 746, 750, 
Fol 
measurement error and, 355 
paired data and, 596-597 
sample (see Sample correlation 
coefficient) 
Covariance, 296 
correlation coefficient and, 299 
Cramér—-Rao inequality and, 
441-442 
definition of, 296 
of independent random variables, 
300-301 
of linear functions, 298 
matrix format for, 799 
Covariance matrix, 799 
Covariate, 790 
Cramér—Rao inequality, 441-442 
Credibility interval, 899 
Critical values 
chi-squared, 381 
F, 386 
standard normal (z), 217 
studentized range, 654 
t, 384, 481 
tolerance, 469 
Wilcoxon rank-sum interval, 
876, 925 
Wilcoxon rank-sum test, 
871-879, 888, 924 
Wilcoxon signed-rank interval, 
867, 923 
Wilcoxon signed-rank test, 864, 
865, 886 
Cumulative distribution function, 
119, 195 
Cumulative distribution function for 
a continuous random 
variable, 195 
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for a discrete random variable, 
119 

joint, 352 

of order statistics, 343, 344 

pdf and, 194 

percentiles and, 199 

pmf and, 119, 121, 122 

transformation and, 258 
Cumulative frequency, 25 
Cumulative relative frequency, 25 


D 
Danger of extrapolation, 716 
Data, | 
bivariate, 2, 706, 720, 783 
categorical (see Categorical data) 
censoring of, 31, 409-410 
characteristics of, 1 
collection of, 5—7 
definition of, 1 
multivariate, 2, 19 
qualitative, 18 
univariate, 2 
Deductive reasoning, 4 
Degrees of freedom (df), 34 
in ANOVA, 644-647, 676, 688 
for chi-squared distribution, 
380-382 
in chi-squared tests, 825, 831, 
833, 842 
for F distribution, 385 
in regression, 718, 771 
sample variance and, 35 
for Studentized range 
distribution, 654 
for t distribution, 383, 458, 575, 
581 
type II error and, 663 
Delta method, 207 
De Morgan’s laws, 55 
Density, 16 
conditional, 317-319 
curve, 191 
function (pdf), 191 
joint, 279 
marginal, 281 
scale, 17 
Density curve, 191 
Density scale, 16 
Dependence, 87-91, 283-288, 300, 
319, 844 
Dependent, 88, 284, 704 
Dependent events, 87-91 
Dependent variable, 704 
Descriptive statistics, 1-39 
Design matrix, 796 
Deviations from the mean, 33 
Dichotomous trials, 144 
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Difference statistic, 411 
Discrete, 113 
Discrete random variable(s) 
conditional pmf for, 317 
cumulative distribution function 
of, 119-122 
definition of, 113 
expected value of, 126 
joint pmf of (see Joint probability 
mass function) 
marginal pmf of, 279 
mean of, 127 
moment generating of, 139 
pmf of (see Probability mass 
function) 
standard deviation of, 132 
transformation of, 261 
variance of, 132 
Disjoint events, 53 
Dotplots, 11 
Double-blind experiment, 605 
Dummy variable, 787 
Dunnett’s method, 660 


E 
Effect, 662 
Efficiency, 442 
Efficiency, asymptotic relative, 867, 
876 
Efficient estimator, 442 
Empirical rule, 221 
Erlang distribution, 238, 272 
Error(s) 
estimated standard, 400, 731, 801 
estimation, 402 
family vs. individual, 659 
measurement, 213, 249, 406, 550 
prediction, 468, 741, 768 
rounding, 35 
standard, 400, 801 
type I, 504 
type II, 504 
Error Sum of Squares (SSE), 643, 
718 
Estimated regression function, 760, 
768, 772 
Estimated regression line, 713, 714 
Estimated standard error, 99, 400, 
731, 801 
Estimator, 398, 582 
Event(s), 51 
complement of, 52 
compound, 51, 61 
definition of, 51 
dependent, 87-91 
disjoint, 53 
exhaustive, 80 
independent, 87-91 


indicator function for, 430 
intersection of, 52 
mutually exclusive, 53 
mutually independent, 90 
simple, 51 
union of, 52 
Venn diagrams for, 53 
Expected counts, 824 
Expected mean squares 
in ANOVA, 662, 666, 690, 704 
F test and, 679, 682, 690, 693 
in mixed effects model, 682, 692 
in random effects model, 668, 
682-683 
in regression, 765 
Expected or mean value, 203 
Expected value, 127 
conditional, 320 
of a continuous random variable, 
203 
covariance and, 296 
of a discrete random variable, 
126 
of a function, 129, 294 
heavy-tailed distribution and, 
129, 135 
of jointly distributed random 
variables, 294 
Law of Large Numbers and, 376 
of a linear combination, 303 
of mean squares (see Expected 
mean squares) 
moment generating function and, 
139, 208 
moments and, 137 
in order statistics, 342-343, 348 
of sample mean, 346, 368 
of sample standard deviation, 
405, 446 
of sample total, 368 
of sample variance, 405 
Experiment, 49 
binomial, 144, 285, 823 
definition of, 50 
double-blind, 605 
observational studies in, 571 
paired data, 596 
paired vs. independent samples, 
603 
randomized block, 680-682 
randomized controlled, 571 
repeated measures designs, 681 
simulation, 363-366 
Explanatory variable, 704 
Exponential distribution, 234 
censored experiments and, 409 
chi-squared distribution and, 381 
confidence interval for 
parameter, 457 
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double, 550 
estimators for parameter, 409, 
416 
goodness-of-fit test for, 835 
mixed, 272 
in pure birth process, 445 
shifted, 427, 552 
skew in, 346 
standard gamma distribution and, 
234 
Weibull distribution and, 239 
Exponential random variable(s) 
Box—Muller transformation and, 
342 
cdf of, 235 
expected value of, 234 
independence of, 288 
mean of, 234 
in order statistics, 343, 347 
pdf of, 234 
transformation of, 258, 338, 340 
variance of, 234 
Exponential regression model, 820 
Exponential smoothing, 48 
Extreme outliers, 36-39 
Extreme value distribution, 253 


F 
Factorial notation, 69 
Factorization theorem, 429 
Factors, 639 
Failure rate function, 274 
Family of probability distributions, 
118, 250 
F distribution 
chi-squared distribution and, 385 
definition of, 385 
expected value of, 387 
for model utility test, 735, 772, 
799 
noncentral, 663 
pdf of, 386 
Finite population correction factor, 
167 
First quartile, 36 
Fisher information, 436 
Fisher information J(0), 438 
Fisher—Irwin test, 607 
Fisher transformation, 751 
Fitted (or predicted) values, 678, 
688, 717, 719, 760, 770 
Fixed effects, 667 
Fixed effects model, 667, 682, 693 
Fourth spread, 36, 357 
Frequency, 12 
Frequency distribution, 12 
Friedman’s test, 882, 886, 888 
F test 
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in ANOVA, 647, 683, 690 

Bartlett’s test and, 649 

coefficient of determination and, 
7712 

critical values for, 386, 612, 646 

distribution and, 385, 612, 646 

for equality of variances, 612, 
621 

expected mean squares and, 663, 
679, 682, 690, 693 

Levene test and, 649 

power curves and, 509, 521 

P-value for, 613, 614, 622, 647 

in regression, 772 

sample sizes for, 663 

single-factor, 670 

vs. t test, 664 

two-factor, 682 

type II error in, 663 


G 

Galton, 333-334, 724, 749 

Galton—Watson branching process, 

352, 

Gamma distribution, 231 
chi-squared distribution and, 380 
definition of, 231 
density function for, 232 
Erlang distribution and, 238 
estimation of parameters, 416, 

421, 424 
exponential distribution and, 
234-236 
Poisson distribution and, 901 
standard, 231 
Weibull distribution and, 239 

Gamma function, 231 
incomplete, 233, 253 
properties of, 231 

Gamma random variables, 232 

Geometric distribution, 169, 188 

Geometric random variables, 169 

Global F test, 774 

Goodness-of-fit test 
for composite hypotheses, 829 
definition of, 823 
for homogeneity, 841-843 
for independence, 844-847 
simple, 823 

Gossett, 654 

Grand mean, 642 


H 
Half-normal plot, 258 
Hat matrix, 798 
Histogram 

bimodal, 17 


class intervals in, 14-16 
construction of, 2 
density, 16, 17, 928 
multimodal, 17 
Pareto diagram, 25 
for pmf, 117 
symmetric, 17 
unimodal, 17 
Hodges—Lehmann estimator, 446 
Homogeneity, 841-843 
Honestly Significant Difference 
(HSD), 656 
Hyperexponential distribution, 272 
Hypergeometric distribution, 
165-167 
and binomial distribution, 167 
Hypergeometric random variable, 
165-167 
Hyperparameters, 894 
Hypothesis 
alternative, 502 
composite, 829-836 
definition of, 502 
errors in testing of, 504-509 
notation for, 501 
null, 502 
research, 502 
simple, 542 
Hypothetical population, 5 


I 
Ideal power function, 546 
Inclusion-exclusion principle, 61 
Inclusive inequalities, 152 
Incomplete beta function, 244 
Incomplete gamma function, 233, 
253 
Independence 
chi-squared test for, 844 
conditional distribution and, 319 
correlation coefficient and, 300 
covariance and, 299, 302 
of events, 87-90 
of jointly distributed random 
variables, 283-284, 287 
in linear combinations, 303-304 
mutual, 90 
pairwise, 92, 107 
in simple random sample, 359 
Independent, 88, 284, 287, 704 
Independent and identically 
distributed (iid), 359 
Independent variable, 704 
Indicator (or dummy) variable, 787 
Indicator function, 430 
Inductive reasoning, 4 
Inferential statistics, 4—5 
Inflection point, 213 
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Intensity function, 187 
Interaction, 675, 687, 689-691, 
692-695, 785-790 

Interaction effect, 786 
Interaction parameters, 687 
Interaction plots, 675 
Interaction sum of squares, 690 
Interaction term, 786 
Intercept, 250, 706, 715 
Intercept coefficient, 706 
Interpolation, 717 
Interquartile range (iqr), 36 
Intersection of events 
definition of, 52 
multiplication rule for probability 
of, 78-80, 88 
Invariance principle, 423 
Inverse cdf method, 175 


J 
Jacobian, 337, 340 
Jensen's inequality, 275 
Joint cumulative distribution 
function, 352 
Jointly distributed random variables 
bivariate normal distribution of, 
330-334 
conditional distribution of, 
317-326 
correlation coefficients for, 299 
covariance between, 296 
expected value of function of, 
294 
independence of, 283-284 
linear combination of, 303-309 
in order statistics, 346-348 
pdf of, (see Joint probability 
density functions) 
pmf of (see Joint probability 
mass functions) 
transformation of, 336-341 
variance of function of, 302, 305 
Jointly sufficient statistics, 431 
Joint marginal density function, 293 
Joint pdf, 285 
Joint pmf, 285 
Joint probability density function, 
279 
Joint probability mass function, 
277-279 
Joint probability table, 278 


K 

k-out-of-n system, 153 
Kruskal—Wallis test, 880-881 
Kth central moment, 137 

Kth moment, 137 
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Kth moment about the mean, 137 
Kth moment of the distribution, 416 
Kth population moment, 416 

Kth sample moment, 416 

k-tuple, 67-68 


L 
Lag 1 autocorrelation coefficient, 
757 
Laplace distribution, 550 
Laplace's rule of succession, 900 
Largest extreme value distribution, 
ee 9 | 
Law of Large Numbers, 376-377, 
384-385, 443 
Law of total probability, 80 
Least squares estimates, 714, 731, 
763, 768-769 
Least Squares Regression Line 
(LSRL), 714 
Level « test, 508 
Level of a factor, 639, 673, 682 
Levels, 639 
Levene test, 649-650. 
Leverages, 802 
Likelihood function, 420, 543, 547 
Likelihood ratio 
chi-squared statistic for, 550 
definition of, 543 
mle and, 548 
model utility test and, 820 
in Neyman—Pearson theorem, 
543 
significance level and, 543, 544 
sufficiency and, 447 
tests, 548 
Likelihood ratio test, 548 
Likelihood ratio test statistic, 548 
Limiting relative frequency, 57, 58 
Linear combination, 303 
distribution of, 310 
expected value of, 303 
independence in, 303 
variance of, 304 
Linear probabilistic model, 703, 715 
Linear regression 
additive model for, 704, 705, 767 
ANOVA in, 734, 790 
confidence intervals in, 730, 739 
correlation coefficient in, 
745-754 
definition of, 706 
degrees of freedom in, 718, 771, 
798 
least squares estimates in, 
713-723, 764 
likelihood ratio test in, 820 
mies in, 718 


model utility test in, 648, 772, 
798 
parameters in, 706, 713-723, 767 
percentage of explained variation 
in, 720-721 
prediction interval in, 737, 741, 
775 
residuals in, 717, 758, 771 
summary statistics in, 715 
sums of squares in, 718-723, 771 
t ratio in, 732, 751 
Line graph, 116-117 
Location parameter, 253 
Logistic distribution, 349 
Logistic regression model 
contingency tables for, 847-848 
definition of, 807-808 
fit of, 808-809 
mles in, 809 
in multiple regression analysis, 
790 
Logit function, 807, 808 
Log-likelihood function, 420 
Lognormal distribution, 242-244, 
376 
Lognormal random variables, 
242-243 
Long-run (or limiting) relative 
frequency, 58 
Lower confidence bound, 460 
Lower confidence bound for p, 464 
Lower quartile, 36 


M 

Main effects for factor A, 687 

Main effects for factor B, 687 

Mann—Whitney test, 871-878 

Marginal distribution, 279, 281, 317 

Marginal probability density 

functions, 281 

Marginal probability mass functions, 

279 

Margin of error, 457 

Matrices in regression analysis, 

795-804 

Maximum a posteriori (MAP), 897 

Maximum likelihood estimator, 420 

for Bernoulli parameter, 443 

for binomial parameter, 443 

Cramér—Rao inequality and, 
442 

data sufficiency for, 434 

Fisher information and, 436 

for geometric distribution 
parameter, 631 

in goodness-of-fit testing, 830 

in homogeneity test, 841 

in independence test, 844 
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in likelihood ratio tests, 547 
in linear regression, 720, 726 
in logistic regression, 808 
sample size and, 424 
score function and, 443 
McNemar's test, 611, 636 
Mean 
conditional, 320-321 
correction for the, 644 
deviations from the, 33, 244, 
650, 718, 830 
of a function, 129, 294 
vs. median, 28 
moments about, 137 
outliers and, 27, 28 
population, 27 
regression to the, 334, 749 
sample, 26 
Mean square 
expected, 663, 679, 682, 683, 
693 
lack of fit, 766 
pure error, 766 
Mean square error 
definition of, 403 
of an estimator, 403 
MVUE and, 407 
sample size and, 406 
Mean Square for Error (MSE), 403, 
645 
Mean Square for Treatments 
(MSTr), 645 
Mean-square value, 133 
Mean value, 127 
Mean vector, 799 
Measurement error, 406 
Median, 198 
in boxplot, 37-38 
of a distribution, 27—28 
as estimator, 444 
vs. mean, 28 
outliers and, 27, 28 
population, 28 
sample, 26 
statistic, 342 
Memoryless property, 236 
Mendel's law of inheritance, 826, 
827 
M-estimator, 425, 448 
Method of Moments Estimators 
(MMEs), 416 
Midfourth, 47 
Midrange, 399 
Mild outlier, 38 
Minimal sufficient statistic, 432, 434 
Minimize absolute deviations 
principle, 763 
Minimum variance unbiased 
estimator, 407 


Index 


Mixed effects model, 682 
Mixed exponential distribution, 283 
mle. See Maximum likelihood 
estimate 
Mode, 46 
of a continuous distribution, 271 
of a data set, 46 
of a discrete distribution, 186 
Model equation, 662 
Model utility test, 732 
Moment generating function, 139, 
208 
of a Bernoulli rv, 139 
of a binomial rv, 151 
of a chi-squared rv, 316 
of a continuous rv, 208 
definition of, 139, 208 
of a discrete rv, 139 
of an exponential rv, 258 
of a gamma rv, 232 
of a linear combination, 309 
and moments, 141, 208 
of a negative binomial rv, 169 
of a normal rv, 221 
of a Poisson rv, 160 
of a sample mean, 395 
uniqueness property of, 140, 208 
Moments 
definition of, 137 
method of, 397, 416, 418, 424 
and moment generating function, 
208 
Monotonic, 259 
Multimodal histogram, 17 
Multinomial distribution, 286 
Multinomial experiment, 286, 823, 
824 
Multiple comparisons procedure, 
653 
Multiple logit function, 812 
Multiple regression 
additive model, 767, 795 
categorical variables in, 787, 790 
coefficient of multiple 
determination, 772 
confidence intervals in, 776 
covariance matrices in, 799 
degrees of freedom in, 771 
diagnostic plots, 777 
fitted values in, 769 
F ratio in, 774 
interaction in models for, 785, 
788, 790 
leverages in, 802, 803 
model utility test in, 772 
normal equations in, 768, 796 
parameters for, 767 
and polynomial regression, 783 
prediction interval in, 776 


principle of least squares in, 768, 
796 
residuals in, 775, 777 
squared multiple correlation in, 
FeAl 
sums of squares in, 771 
Multiplication rule, 78, 79, 82, 88, 
101 
Multiplicative exponential 
regression model, 820 
Multiplicative power regression 
model, 820 
Multivariate, 2 
Multivariate hypergeometric 
distribution, 291 
Mutually exclusive events, 53, 92 
Mutually independent events, 90 
MVUE. See Minimum variance 
unbiased estimator 


N 

Negative binomial distribution, 

168-170 

Negative binomial random variable, 
168 

estimation of parameters, 417 

Negatively skewed, 17 

Newton’s binomial theorem, 169 

Neyman factorization theorem, 429 

Neyman-Pearson theorem, 543-545, 

547 

Noncentrality parameter, 497, 663, 

671 

Noncentral F distribution, 663 

Noncentral ¢ distribution, 497 

Nonhomogeneous Poisson Process, 

187 

Nonparametric methods, 855-888 

Nonstandard normal distribution, 

218 

Normal distribution, 213 

asymptotic, 372, 436, 443 

binomial distribution and, 


223-224, 375 
bivariate, 330-333, 389, 550, 
754 


confidence interval for mean of, 
452-454, 456, 460, 463 

continuity correction and, 
223-224 

density curves for, 213 

and discrete random variables, 
222, 223 

of linear combination, 389 

lognormal distribution and, 242, 
376 

nonstandard, 218 

pdf for, 212 
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percentiles for, 215-217, 220, 
221, 248 
probability plot, 247 
standard, 214 
t distribution and, 383-386 
z table, 214-216 
Normal equations, 714, 768, 805 
Normal probability plot, 247 
Normal random variable, 214 
Null distribution, 517, 862 
Null hypothesis, 502 
Null hypothesis of homogeneity, 841 
Null hypothesis of independence, 
845 
Null set, 53, 56 
Null value, 503, 512 


O 
Observational study, 571 
Observed counts, 824 
Odds, 808 
Odds ratio, 808, 847-848 
One-sample ¢ CI, 464 
One-sided confidence interval, 460 
One-way ANOVA, 639 
Operating characteristic curve, 154 
Ordered categories, 847-848 
Ordered pairs, 66-67 
Order statistics, 342-347, 402, 
431-432, 552, 856 
sufficiency and, 431-432 
Outliers, 37 
in a boxplot, 37-39 
extreme, 37 
leverage and, 802 
mean and, 27, 29, 490-491 
median and, 27, 29, 37, 490, 491 
mild, 37 
in regression analysis, 763 


P 
Paired data 
in before/after experiments, 594, 
611 
bootstrap procedure for, 622-624 
confidence interval for, 592-594 
definition of, 591 
vs. independent samples, 597 
in McNemar’s test, 611 
permutation test for, 624 
t test for, 594-596 
in Wilcoxon signed-rank test, 
864-866 
Pairwise average, 868, 869, 876 
Pairwise independence, 107 
Parallel connection, 54, 90, 92, 93, 
343, 344 
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Parameter(s), 118 
confidence interval for, 457, 458 
estimator for a, 397-410 
Fisher information on, 436-444 
goodness-of-fit tests for, 
827-829, 830-833 
hypothesis testing for, 503, 526 
location, 253, 432 
maximum likelihood estimate of, 
420-425, 434 
moment estimators for, 416-418 
MVUE of, 407-409, 424, 434, 
442 
noncentrality, 663 
null value of, 503 
of a probability distribution, 
118-119 
in regression, 706-707, 713-723, 
741, 749, 767, 808 
scale, 232, 240, 253-255, 431 
shape, 254-255, 431-434 
sufficient estimation of, 428 
Parameter space, 547 
Pareto diagram, 25 
Pareto distribution, 201, 211, 262 
Partial F test, 793 
pdf. See Probability density function 
Pearson’s chi-squared, 825 
Percentiles 
for continuous random variables, 
198-200 
in hypothesis testing, 534 
in probability plots, 247-253 
sample, 29, 248, 253 
of standard normal distribution, 
215-217, 247-253 
Permutation, 68 
Permutation test, 619-625 
PERT analysis, 245 
Pie chart, 18 
Pivotal quantity, 457 
Plot 
probability, 247-256, 435, 577, 
TSO: FTF 
scatter, 704—706, 720-721, 746, 
750 
pmf. See Probability mass function 
Point estimate/estimator, 398 
biased, 406-408 
bias of, 403-406 
bootstrap techniques for, 410, 
484-492 
bound on the error of estimation, 
457 
censoring and, 409-410 
consistency, 424, 443-444 
for correlation coefficient, 
749-750 


and Cramér—Rao inequality, 
441-444 
definition of, 27, 359, 398 
efficiency of, 442 
Fisher information on, 436-444 
least squares, 714-718 
maximum likelihood (mle), 
418-425 
of a mean, 27, 359, 398, 407 
mean squared error of, 403 
moments method, 416-418, 424 
MVUE of, 407, 424, 434, 442 
notation for, 397, 399 
of a standard deviation and, 399, 
405 
standard error of, 410 
of a variance, 399, 405 
Point prediction, 468, 716 
Poisson distribution, 156 
Erlang distribution and, 238 
expected value, 160, 164 
exponential distribution and, 235 
gamma distribution and, 896 
goodness-of-fit tests for, 
833-835 
in hypothesis testing, 542-544 
mode of, 186 
moment generating function for, 
160 
parameter of, 160 
and Poisson process, 160-161, 
235 
variance, 160, 164 
Poisson process, 160-161 
nonhomogeneous, 187 
Polynomial regression model, 
783-784 
Pooled, 582 
Pooled ¢ procedures 
and ANOVA, 549, 581-582, 664 
vs. Wilcoxon rank-sum 
procedures, 875 
Population, 1 
Population mean, 27 
Population median, 28 
Population (or true) regression line, 
707 
Population standard deviation, 34 
Positively skewed, 17 
Posterior distribution, 890 
Posterior probability, 80-83, 899, 
900 
Power, 509 
Power curves, 509, 663-664 
Power function, 545 
Power function of a test, 545-547, 
663-664 
Power model for regression, 820 


Index 


Power of a test 
Neyman-Pearson theorem and, 
545-547 
type II error and, 522, 544-548, 
582 
Precision, 400, 451, 456, 468, 478, 
594, 682, 895 
Predicted values, 678, 717, 770 
Prediction interval, 469, 741 
Bonferroni, 742 
vs. confidence interval, 469, 
741-742, 776 
in linear regression, 738, 
741-742 
in multiple regression, 775 
for normal distribution, 467-469 
Prediction level, 469, 742, 776 
Predictor, 704 
Predictor variable, 704, 767, 
783-786 
Principle of least squares, 713-723, 
763, 768, 777 
Prior distribution, 889 
Prior probability, 80, 889 
Probability, 49 
conditional, 75-83, 86-88, 236, 
317-319, 428, 431-432 
continuous random variables 
and, 114, 189-262, 279-288, 
317-319 
counting techniques for, 66—72 
definition of, 49 
density function (see Probability 
density function) 
of equally likely outcomes, 
61-62 
histogram, 118, 190-191, 
222-224, 361-362 
inferential statistics and, 4, 9, 357 
Law of Large Numbers and, 
376-377, 384-385 
law of total, 80 
mass function (see Probability 
mass function) 
of null event, 56 
plots, 247-256, 435, 569, 750, 
760, 777 
posterior/prior, 80-82, 889, 897, 
900. 
properties of, 55-62 
relative frequency and, 57-58, 
363-364 
sample space and, 49-53, 55—56, 
62, 66, 109 
and Venn diagrams, 53, 60, 76 
Probability density function (pdf), 
191 
conditional, 318-320 
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definition of, 191 

joint, 277-348, 383, 420, 
429-431, 434, 542, 547 

marginal, 281-283, 339-340 

vs. pmf, 192 


Probability distribution, 116, 191 


Bernoulli, 112, 116-119, 128, 
139-140, 143, 150, 375, 441, 
442, 443, 892 

beta, 244-245 

binomial, 144-151, 157-159, 
223-224, 375, 417-418, 
475-476, 503-507 

bivariate normal, 330-334, 550, 
Jon 

Cauchy, 263, 274, 342 

chi-squared, 261, 380-382, 408 

conditional, 317-326 

continuous, 114, 189-276 

discrete, 111-188 

exponential, 234-236, 239, 409 

extreme value, 253-254 

F, 385-386 

family, 118, 250, 253-256, 646 

gamma, 230-237, 253-254 

geometric, 121-122, 129, 169, 
261 

hyperexponential, 272 

hypergeometric, 165-167, 
305-306 

joint, 277-356, 749-750, 830 

Laplace, 317, 550-551 

of a linear combination, 
303-311, 331 

logistic, 349 

lognormal, 242-244, 376 

multinomial, 286, 823 

negative binomial, 168-170 

normal, 213-225, 242-253, 
330-334, 368-376, 388, 828 

parameter of a, 118-119 

Pareto, 201, 211, 262 

Poisson, 156-161, 235 

Rayleigh, 200, 263, 414, 427 

of a sample mean, 357-366, 
368-377 

standard normal, 214—217 

of a statistic, 357-377 

Studentized range, 654 

symmetric, 17, 29, 138, 200, 
203, 213 

t, 383-385, 386, 536, 594 

uniform, 192-193, 195 

Weibull, 239-241 


Probability generating function, 187 
Probability histogram, 118 
Probability mass function, 116 


conditional, 317-318 
definition of, 116-123 


joint, 277-281 
marginal, 279 
Probability of the event, 55 
Probability plot, 247 
Product rules, 66-67 
Proportion 
population, 475, 526-529, 
602-607 
sample, 225, 375, 413, 602, 
845 


trimming, 29, 399, 406, 408-409 


Pure birth process, 445 
P-value, 532 
for chi-squared test, 826-827 
definition of, 532 
for F tests, 613-615 
for ¢ tests, 536-539 
type I error and, 534 
for z tests, 534-536 


Q 

Quadratic regression model, 783 

Qualitative data, 18 

Quantile, 198, 225, 237, 365, 382, 
387, 458, 490, 618, 829, 
855-857, 899 

Quantitative characteristic, 1 

Quartiles, 36 


R 

Random effects, 667 

Random effects model, 667-668, 
682-683, 692-695 

Random interval, 453-455 

Random number generator, 6, 74, 
95, 174, 265, 854 

Randomized block experiment, 
680-682 


Randomized controlled experiment, 


eyrg | 
Randomized response technique, 
415 
Random sample, 343, 359 
Random variable, 111, 112 
continuous, 189-275 
definition of, 112 
discrete, 111-188 
jointly distributed, 277-356 
standardizing of, 218 
types of, 113 
Range, 32 
definition of, 32 
in order statistics, 342-345 
population, 458 
sample, 32, 342-345 
Studentized, 654-655 
Rank average, 886 
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Rao-Blackwell theorem, 433-434, 
443, 444 
Ratio statistic, 550 
Rayleigh distribution, 263, 414, 427 
Regression 
coefficient, 727-735, 767-770, 
795-797, 799-800 
effect, 334, 749 
function, 704, 760, 766, 771, 
783, 788 
line, 707-709, 713-723, 
727-732 
linear, 706-709, 713-723, 
727-734, 737-742 
logistic, 806-809 
matrices for, 795-804 
to the mean, 334 
multiple, 767-776 
multiplicative exponential model, 
820 
multiplicative power model for, 
820 
plots for, 760-764 
polynomial, 783-784 
quadratic, 783-784 
through the origin, 448-496 
Regression effect, 749 
Regression sum of squares, 723 
Regression to the mean, 749 
Rejection region, 504 
cutoff value for, 504-508 
definition of, 504 
lower-tailed, 505, 513-514 
in Neyman—Pearson theorem, 
543-547 
two-tailed, 513 
type I error and, 504 
in union-intersection test, 637 
upper-tailed, 506, 513-514 
Relative frequency, 12-18, 57-58 
Repeated-measures, 681 
Repeated measures designs, 681 
Replications, 57, 363-365, 455, 672 
Resample, 485 
Research hypothesis, 502 
Residual plots, 678, 688, 760-763 
Residuals 
in ANOVA, 648, 677, 691, 717 
definition of, 643 
leverages and, 801-802 
in linear regression, 717, 
758-762 
in multiple regression, 771 
standard error, 757 
standardizing of, 758, 777 
variance of, 758, 801 
Residual standard deviation, 718, 
Fil 
Residual sum of squares, 718 
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Residual vector, 798 
Response, 704 

Response variable, 6, 704, 807 
Response vector, 796 

Robust estimator, 409 
Ryan-Joiner test, 256 


Ny 
Sample, | 
convenience, 6 
definition of, 1 
outliers in, 37—38 
simple random, 6, 359 
size of (see Sample size) 
stratified, 6 
Sample coefficient of variation, 44 
Sample correlation coefficient, 746 
in linear regression, 745-746, 
751 
vs. population correlation 
coefficient, 749, 751-754 
properties of, 746-747 
strength of relationship, 747 
Sample mean, 26 
definition of, 26 
population mean and, 368-377 
sampling distribution of, 
368-377 
Sample median, 27 
definition of, 27 
in order statistics, 342-343 
vs. population median, 491 
Sample moments, 416 
Sample percentiles, 248 
Sample proportion, 225 
Sample size, 9 
in ANOVA, 663-664 
asymptotic relative efficiency 
and, 867-875 
Central Limit Theorem and, 
375 
confidence intervals and, 
456-457, 458, 464 
definition of, 9 
in finite population correction 
factor, 167 
for F test, 663-664 
for Levene test, 649-650 
mle and, 424, 443 
noncentrality parameter and, 
663-664, 671 
Poisson distribution and, 156 
for population proportion, 
475-478 
probability plots and, 252 
in simple random sample, 359 
type I error and, 508, 516, 517, 
570, 605 


type II error and, 508, 516, 517, 
527, 369, 582 
variance and, 377 
z test and, 515-517, 527-528 
Sample space, 50 
definition of, 49 
determination, 457, 466-467, 
516, 522, 528, 570-571, 605 
probability of, 55-62 
Venn diagrams for, 53 
Sample standard deviation, 33 
in bootstrap procedure, 487 
confidence bounds and, 460 
confidence intervals and, 464 
definition of, 33 
as estimator, 406, 446 
expected value of, 405, 446 
independence of, 389, 390 
mle and, 423 
population standard deviation 
and, 359, 405, 446 
sample mean and, 33, 389 
sampling distribution of, 360, 
361, 383, 406, 446 
variance of, 561 
Sample total, 368 
Sample variance, 33 
in ANOVA, 642 
calculation of, 35 
definition of, 33 
distribution of, 359-362, 383 
expected value of, 405 
population variance and, 34, 
388-389, 402 
Sampling distribution, 359 
bootstrap procedure and, 485, 
617,855 
definition of, 357, 359 
derivation of, 360-363 
of intercept coefficient, 818 
of mean, 360-362, 481-484 
permutation tests and, 855 
simulation experiments for, 
363-366 
of slope coefficient, 727-734 
Scale parameter, 231, 239-240, 
253-254, 431 
Scatter plot, 704-705 
Scheffe” method, 702 
Score function, 439-441 
Segmented bar chart, 843 
Series connection, 343-344 
Set theory, 51-53 
Shape parameters, 254—255, 432 
Shapiro-Wilk test, 256 
Siegel-Tukey test, 888 
Signed-rank interval, 867 
Signed ranks, 862 
Significance 
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practical, 553-554, 836 
statistical, 554, 571, 836 
Significance level, 508 
definition of, 508 
joint distribution and, 552 
likelihood ratio and, 542 
observed, 533 
Sign interval, 860 
Sign test, 858, 860 
Simple events, 51, 61, 66 
Simple hypothesis, 542, 829 
Simple random sample, 6 
definition of, 6, 359 
independence in, 359 
sample size in, 359 
Simulation experiment, 359, 
363-366, 491, 538 
Single-classification, 639 
Single-factor, 639 
Skewed data 
coefficient of skewness, 138, 211 
definition of, 17 
in histograms, 17, 487 
mean vs. median in, 28 
measure of, 138 
probability plot of, 253, 486-487 
Skewness coefficient, 138 
Slope, 706-707, 715, 728, 730, 808 
Slope coefficient, 706 
confidence interval for, 730 
definition of, 706—707 
hypothesis tests for, 732 
least squares estimate of, 714 
in logistic regression model, 808 
Standard beta distribution, 244 
Standard deviation, 132, 205 
normal distribution and, 213 
of point estimator, 400-402 
population, 133, 205 
of a random variable, 133, 205 
sample, 32 
z table and, 218 
Standard error, 150, 400-402 
Standard error of the mean, 369 
Standard gamma distribution, 231 
Standardized residuals, 758 
Standardized variable, 218 
Standard normal distribution, 214 
Cauchy distribution and, 342 
chi-squared distribution and, 381 
critical values of, 217 
definition of, 214 
density curve properties for, 
214-217 
F distribution and, 385, 387 
percentiles of, 215-217 
t distribution and, 387, 391 
Standard normal random variable, 
214, 387 
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Statistic, 359 
Statistical hypothesis, 501 
Stem-and-leaf display, 9-11 
Step function, 121 
Stratified samples, 6 
Studentized range distribution, 654 
Student rf distribution, 383 
Sufficient, 429 
Sufficient statistic(s), 429, 430, 
432-436, 443, 444, 446, 447 
Summary statistics, 715, 719, 731, 
753 
Sum of squares 
error, 643, 718, 722 
interaction, 690 
lack of fit, 766 
pure error, 766 
regression, 723, 797 
total, 677, 734, 773 
treatment, 644-648 
Support, 116, 191 
Symmetric, 17, 200 
Symmetric distribution, 17, 138, 200 


T 
Taylor series, 207, 667 
t confidence interval 
heavy tails and, 867, 876, 869 
in linear regression, 730, 738 
in multiple regression, 776 
one-sample, 463-465 
paired, 594-596 
pooled, 582 
two-sample, 575, 596 
t critical value, 463 
t distribution 
central, 497 
chi-squared distribution and, 391, 
579, 590 
critical values of, 384, 463, 524, 
536 
definition of, 390 
degrees of freedom in, 390, 475 
density curve properties for, 384, 
463 
F distribution and, 663 
noncentral, 497 
standard normal distribution and, 
383, 384, 464 
Student, 383-385 
Test of hypotheses, 502 
Test statistic, 503, 504 
Third quartile, 36 
Time series, 48, 757 
Tolerance interval, 469 
Total sum of squares, 644, 720 
Transformation, 167, 258-262, 
336-341 


Treatment, 640, 642-643, 672, 673 
Treatment sum of squares SSTr, 643 
Tree diagram, 67-68, 79, 82, 89 
Trial, 144-147 
Trimmed mean, 29 
definition of 
in order statistics, 342-343 
outliers and, 29 
as point estimator, 398 
population mean and, 406, 409 
Trimming proportion, 29, 409 
True (or population) regression 
coefficients, 767 
True (or population) regression 
function, 767 
True regression line, 707-709, 713, 
727-728 
t test 
vs. F test, 664 
heavy tails and, 867, 876, 869 
likelihood ratio and, 547, 548 
in linear regression, 734 
in multiple regression, 775-777, 
822 
one-sample, 463-465, 536, 
547-549, 592, 875 
paired, 592 
pooled, 581-582, 664 
P-value for, 536-537 
two-sample, 575-578, 596, 664 
type I error and, 517-519, 576 
type II error and, 517-520, 582 
vs. Wilcoxon rank-sum test, 875 
vs. Wilcoxon signed-rank test, 
866-867 
Tukey's procedure, 654-659, 666, 
679-680, 691 
Two one-sided tests, 637 
Two-proportion z interval, 606 
Two-sample t confidence interval for 
Hi — Ha, S75 
Two-sample ¢ test, 575 
Two-way contingency table, 840 
Type I error, 504 
definition of, 544 
Neyman-—Pearson theorem and, 
543 
power function of the test and, 
545 
P-value and, 532-533 
sample size and, 516 
significance level and, 508 
vs. type II error, 508 
Type II error, 504 
definition of, 504 
vs. type I error, 508 
Type II error probability 
in ANOVA, 663-665, 699 
degrees of freedom and, 597 
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for F test, 663-665, 686 

in linear regression, 736 

Neyman-—Pearson theorem and, 
542-545 

power of the test and, 546 

sample size and, 515, 549-550, 
554, 580, 582 

in tests concerning means, 515 

in tests concerning proportions, 
527-528, 605-606 

t test and, 582 

vs. type I error probability, 508 

in Wilcoxon rank-sum test, 875 

in Wilcoxon signed-rank test, 
866-867 


U 
Unbiased estimator, 400-410 


minimum variance, 406-408 


Unbiased tests, 547 
Uncorrelated, 300 
Uncorrelated random variables, 300, 


304 


Uniform distribution, 192 


el 


Ce 64 Eve; ee 


beta distribution and, 893 
Box—Muller transformation and, 
342 
definition of, 192 
discrete, 135 
transformation and, 260-261 
niformly most powerful 
(UMP) level « test, 547 
niformly most powerful test, 
545-546 
nimodal histogram, 17-18 
nion-intersection test, 637 
nion of events, 51 
nivariate data, 2 
pper confidence bound, 460 
pper confidence bound for pu, 464 
pper quartile, 36 


Vv 
Variable(s), 2 


covariate, 790 

in a data set, 9 
definition of, 1 
dependent, 704 
dummy, 787-789 
explanatory, 704 
independent, 704 
indicator, 787-789 
predictor, 704 
random, 110 
response, 704 


Variable utility test, 775 
Variance, 132, 205 
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conditional, 319-321 
confidence interval, 481-484 
of a function, 134-135, 207-208, 
355 
of a linear function, 134-137, 
305 
population, 133, 205 
precision and, 895 
of a random variable, 132, 205 
sample, 32-36 
Variances, comparing two, 611-616 
Venn diagram, 53, 60, 76, 77 


Ww 
Walsh averages, 868 
Weibull distribution, 239 
basics of, 239-243 
chi-squared distribution and, 274 
estimation of parameters, 422, 
425-426 


extreme value distribution and, 
253 
probability plot, 253-254 
Weighted average, 127, 203, 323, 
582, 895 
Weighted least squares, 763 
Weighted least squares estimates, 
763 
Wilcoxon rank-sum interval, 876 
Wilcoxon rank-sum test, 871-875 
Wilcoxon signed-rank interval, 869 
Wilcoxon signed-rank test, 861-867 


Z 
z confidence interval 
for a correlation coefficient, 753 
for a difference between means, 
581 
for a difference between 
proportions, 609 
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for a mean, 456 
for a proportion, 475 
z Critical values, 217 
z curve 
area under, maximizing of, 552 
rejection region and, 513 
t curve and, 384 
z test 
chi-squared test and, 850 
for a correlation coefficient, 753 
for a difference between means, 
565-581 
for a difference between 
proportions, 603 
for a mean, 514, 519 
for a Poisson parameter, 462, 561 
for a proportion, 527 
P-value for, 534 


